Turkish Address Classifier (State + City)

This is a multi-output model that predicts both state (province) and city (district) from Turkish street addresses.

Model Description

  • Base Model: dbmdz/bert-base-turkish-uncased
  • Task: Multi-output text classification
  • Language: Turkish (tr)
  • Outputs:
    • State (İl): 81 Turkish provinces
    • City (İlçe): 954 districts/municipalities
  • Training Data: Turkish street addresses with state and city labels

Usage

Installation

pip install transformers torch huggingface_hub

Basic Prediction

from transformers import AutoTokenizer, BertPreTrainedModel, BertModel
from torch import nn
import torch
import pickle
from huggingface_hub import hf_hub_download

# Define the model architecture
class MultiOutputModel(BertPreTrainedModel):
    def __init__(self, config, num_state_labels, num_city_labels):
        super().__init__(config)
        self.bert = BertModel(config)
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        
        self.state_classifier = nn.Linear(config.hidden_size, num_state_labels)
        self.city_classifier = nn.Linear(config.hidden_size, num_city_labels)
        
        self.num_state_labels = num_state_labels
        self.num_city_labels = num_city_labels
        
        self.post_init()
    
    def forward(self, input_ids=None, attention_mask=None, state_labels=None, city_labels=None, **kwargs):
        outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
        pooled_output = outputs[1]
        pooled_output = self.dropout(pooled_output)
        
        state_logits = self.state_classifier(pooled_output)
        city_logits = self.city_classifier(pooled_output)
        
        return {
            'state_logits': state_logits,
            'city_logits': city_logits
        }

# Load model and label encoders
model_path = "ucanbaklava/turkish-address-classifier_new"

# Load label encoders
le_state_path = hf_hub_download(repo_id=model_path, filename="state_label_encoder.pkl")
le_city_path = hf_hub_download(repo_id=model_path, filename="city_label_encoder.pkl")

with open(le_state_path, "rb") as f:
    le_state = pickle.load(f)
with open(le_city_path, "rb") as f:
    le_city = pickle.load(f)

# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)

from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_path)
model = MultiOutputModel(config, len(le_state.classes_), len(le_city.classes_))
model = model.from_pretrained(
    model_path, 
    config=config,
    num_state_labels=len(le_state.classes_),
    num_city_labels=len(le_city.classes_)
)

model.eval()

# Prediction function
def predict_address(text: str):
    inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
    
    with torch.no_grad():
        outputs = model(**inputs)
        state_logits = outputs['state_logits']
        city_logits = outputs['city_logits']
        
        state_probs = torch.nn.functional.softmax(state_logits, dim=-1)
        city_probs = torch.nn.functional.softmax(city_logits, dim=-1)
        
        state_conf, state_pred = torch.max(state_probs, dim=-1)
        city_conf, city_pred = torch.max(city_probs, dim=-1)
    
    predicted_state = le_state.inverse_transform([state_pred.item()])[0]
    predicted_city = le_city.inverse_transform([city_pred.item()])[0]
    
    return {
        'state': predicted_state,
        'state_confidence': state_conf.item(),
        'city': predicted_city,
        'city_confidence': city_conf.item()
    }

# Example usage
result = predict_address("atatürk caddesi no:5")
print(f"State: {result['state']} ({result['state_confidence']:.2%})")
print(f"City: {result['city']} ({result['city_confidence']:.2%})")

Example Output

>>> result = predict_address("bağdat caddesi")
>>> print(result)
{
    'state': 'istanbul',
    'state_confidence': 0.95,
    'city': 'kadıköy',
    'city_confidence': 0.92
}

Training Details

Hyperparameters

  • Epochs: 10 (full training, no early stopping)
  • Batch Size: 512
  • Learning Rate: 2e-5
  • Max Sequence Length: 128
  • Optimizer: AdamW with weight decay 0.01
  • Mixed Precision: BF16 + TF32 (A100 optimized)
  • Hardware: NVIDIA A100-80GB GPU

Data Processing

  • Dataset Size: 1,148,700 Turkish addresses
  • Train/Validation Split: 90/10 (1,033,830 / 114,870 addresses)
  • Turkish-aware lowercasing (handles İ/i, I→ı correctly)
  • Street type transformations: (Sokak)→sokak, (Cadde)→caddesi, (Bulvar)→bulvarı
  • Removed (küme evler) patterns
  • Removed empty/missing values

Model Architecture

The model uses BERT as a backbone with two separate classification heads:

  1. State Classifier: Predicts one of 81 Turkish provinces
  2. City Classifier: Predicts one of 954 districts/municipalities

Both classifiers share the same BERT encoder, making the model efficient and allowing it to learn shared representations for Turkish addresses.

Limitations

  • Trained specifically on Turkish addresses
  • May not generalize well to addresses with very different formatting
  • Performance depends on the quality and coverage of training data
  • Some rare city/state combinations may have lower accuracy

Citation

If you use this model, please cite:

@misc{turkish-address-classifier,
  author = {ucanbaklava},
  title = {Turkish Address Classifier (State + City)},
  year = {2025},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/ucanbaklava/turkish-address-classifier_new}}
}

Model Card Authors

ucanbaklava

Model Card Contact

For questions or feedback, please open an issue on the model repository.

Downloads last month
6
Safetensors
Model size
0.1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support