Turkish Address Classifier (State + City)

This is a multi-output model that predicts both state (province) and city (district) from Turkish street addresses.

Model Description

Base Model: dbmdz/bert-base-turkish-uncased
Task: Multi-output text classification
Language: Turkish (tr)
Outputs:
- State (İl): 81 Turkish provinces
- City (İlçe): 954 districts/municipalities
Training Data: Turkish street addresses with state and city labels

Usage

Installation

pip install transformers torch huggingface_hub

Basic Prediction

from transformers import AutoTokenizer, BertPreTrainedModel, BertModel
from torch import nn
import torch
import pickle
from huggingface_hub import hf_hub_download

# Define the model architecture
class MultiOutputModel(BertPreTrainedModel):
    def __init__(self, config, num_state_labels, num_city_labels):
        super().__init__(config)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
        self.state_classifier = nn.Linear(config.hidden_size, num_state_labels)
        self.city_classifier = nn.Linear(config.hidden_size, num_city_labels)
        
        self.num_state_labels = num_state_labels
        self.num_city_labels = num_city_labels
        
        self.post_init()
    
    def forward(self, input_ids=None, attention_mask=None, state_labels=None, city_labels=None, **kwargs):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        
        state_logits = self.state_classifier(pooled_output)
        city_logits = self.city_classifier(pooled_output)
        
        return {
            'state_logits': state_logits,
            'city_logits': city_logits
        }

# Load model and label encoders
model_path = "ucanbaklava/turkish-address-classifier_new"

# Load label encoders
le_state_path = hf_hub_download(repo_id=model_path, filename="state_label_encoder.pkl")
le_city_path = hf_hub_download(repo_id=model_path, filename="city_label_encoder.pkl")

with open(le_state_path, "rb") as f:
    le_state = pickle.load(f)
with open(le_city_path, "rb") as f:
    le_city = pickle.load(f)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)

from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_path)
model = MultiOutputModel(config, len(le_state.classes_), len(le_city.classes_))
model = model.from_pretrained(
    model_path, 
    config=config,
    num_state_labels=len(le_state.classes_),
    num_city_labels=len(le_city.classes_)
)

model.eval()

# Prediction function
def predict_address(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    
    with torch.no_grad():
        outputs = model(**inputs)
        state_logits = outputs['state_logits']
        city_logits = outputs['city_logits']
        
        state_probs = torch.nn.functional.softmax(state_logits, dim=-1)
        city_probs = torch.nn.functional.softmax(city_logits, dim=-1)
        
        state_conf, state_pred = torch.max(state_probs, dim=-1)
        city_conf, city_pred = torch.max(city_probs, dim=-1)
    
    predicted_state = le_state.inverse_transform([state_pred.item()])[0]
    predicted_city = le_city.inverse_transform([city_pred.item()])[0]
    
    return {
        'state': predicted_state,
        'state_confidence': state_conf.item(),
        'city': predicted_city,
        'city_confidence': city_conf.item()
    }

# Example usage
result = predict_address("atatürk caddesi no:5")
print(f"State: {result['state']} ({result['state_confidence']:.2%})")
print(f"City: {result['city']} ({result['city_confidence']:.2%})")

Example Output

>>> result = predict_address("bağdat caddesi")
>>> print(result)
{
    'state': 'istanbul',
    'state_confidence': 0.95,
    'city': 'kadıköy',
    'city_confidence': 0.92
}

Training Details

Hyperparameters

Epochs: 10 (full training, no early stopping)
Batch Size: 512
Learning Rate: 2e-5
Max Sequence Length: 128
Optimizer: AdamW with weight decay 0.01
Mixed Precision: BF16 + TF32 (A100 optimized)
Hardware: NVIDIA A100-80GB GPU

Data Processing

Dataset Size: 1,148,700 Turkish addresses
Train/Validation Split: 90/10 (1,033,830 / 114,870 addresses)
Turkish-aware lowercasing (handles İ/i, I→ı correctly)
Street type transformations: (Sokak)→sokak, (Cadde)→caddesi, (Bulvar)→bulvarı
Removed (küme evler) patterns
Removed empty/missing values

Model Architecture

The model uses BERT as a backbone with two separate classification heads:

State Classifier: Predicts one of 81 Turkish provinces
City Classifier: Predicts one of 954 districts/municipalities

Both classifiers share the same BERT encoder, making the model efficient and allowing it to learn shared representations for Turkish addresses.

Limitations

Trained specifically on Turkish addresses
May not generalize well to addresses with very different formatting
Performance depends on the quality and coverage of training data
Some rare city/state combinations may have lower accuracy

Citation

If you use this model, please cite:

@misc{turkish-address-classifier,
  author = {ucanbaklava},
  title = {Turkish Address Classifier (State + City)},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ucanbaklava/turkish-address-classifier_new}}
}

Model Card Authors

ucanbaklava

Model Card Contact

For questions or feedback, please open an issue on the model repository.

Downloads last month: 6

Safetensors

Model size

0.1B params

Tensor type

F32