Turkish Address Classifier (State + City)
This is a multi-output model that predicts both state (province) and city (district) from Turkish street addresses.
Model Description
- Base Model: dbmdz/bert-base-turkish-uncased
- Task: Multi-output text classification
- Language: Turkish (tr)
- Outputs:
- State (İl): 81 Turkish provinces
- City (İlçe): 954 districts/municipalities
- Training Data: Turkish street addresses with state and city labels
Usage
Installation
pip install transformers torch huggingface_hub
Basic Prediction
from transformers import AutoTokenizer, BertPreTrainedModel, BertModel
from torch import nn
import torch
import pickle
from huggingface_hub import hf_hub_download
# Define the model architecture
class MultiOutputModel(BertPreTrainedModel):
def __init__(self, config, num_state_labels, num_city_labels):
super().__init__(config)
self.bert = BertModel(config)
self.dropout = nn.Dropout(config.hidden_dropout_prob)
self.state_classifier = nn.Linear(config.hidden_size, num_state_labels)
self.city_classifier = nn.Linear(config.hidden_size, num_city_labels)
self.num_state_labels = num_state_labels
self.num_city_labels = num_city_labels
self.post_init()
def forward(self, input_ids=None, attention_mask=None, state_labels=None, city_labels=None, **kwargs):
outputs = self.bert(input_ids=input_ids, attention_mask=attention_mask)
pooled_output = outputs[1]
pooled_output = self.dropout(pooled_output)
state_logits = self.state_classifier(pooled_output)
city_logits = self.city_classifier(pooled_output)
return {
'state_logits': state_logits,
'city_logits': city_logits
}
# Load model and label encoders
model_path = "ucanbaklava/turkish-address-classifier_new"
# Load label encoders
le_state_path = hf_hub_download(repo_id=model_path, filename="state_label_encoder.pkl")
le_city_path = hf_hub_download(repo_id=model_path, filename="city_label_encoder.pkl")
with open(le_state_path, "rb") as f:
le_state = pickle.load(f)
with open(le_city_path, "rb") as f:
le_city = pickle.load(f)
# Load tokenizer and model
tokenizer = AutoTokenizer.from_pretrained(model_path)
from transformers import AutoConfig
config = AutoConfig.from_pretrained(model_path)
model = MultiOutputModel(config, len(le_state.classes_), len(le_city.classes_))
model = model.from_pretrained(
model_path,
config=config,
num_state_labels=len(le_state.classes_),
num_city_labels=len(le_city.classes_)
)
model.eval()
# Prediction function
def predict_address(text: str):
inputs = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=128)
with torch.no_grad():
outputs = model(**inputs)
state_logits = outputs['state_logits']
city_logits = outputs['city_logits']
state_probs = torch.nn.functional.softmax(state_logits, dim=-1)
city_probs = torch.nn.functional.softmax(city_logits, dim=-1)
state_conf, state_pred = torch.max(state_probs, dim=-1)
city_conf, city_pred = torch.max(city_probs, dim=-1)
predicted_state = le_state.inverse_transform([state_pred.item()])[0]
predicted_city = le_city.inverse_transform([city_pred.item()])[0]
return {
'state': predicted_state,
'state_confidence': state_conf.item(),
'city': predicted_city,
'city_confidence': city_conf.item()
}
# Example usage
result = predict_address("atatürk caddesi no:5")
print(f"State: {result['state']} ({result['state_confidence']:.2%})")
print(f"City: {result['city']} ({result['city_confidence']:.2%})")
Example Output
>>> result = predict_address("bağdat caddesi")
>>> print(result)
{
'state': 'istanbul',
'state_confidence': 0.95,
'city': 'kadıköy',
'city_confidence': 0.92
}
Training Details
Hyperparameters
- Epochs: 10 (full training, no early stopping)
- Batch Size: 512
- Learning Rate: 2e-5
- Max Sequence Length: 128
- Optimizer: AdamW with weight decay 0.01
- Mixed Precision: BF16 + TF32 (A100 optimized)
- Hardware: NVIDIA A100-80GB GPU
Data Processing
- Dataset Size: 1,148,700 Turkish addresses
- Train/Validation Split: 90/10 (1,033,830 / 114,870 addresses)
- Turkish-aware lowercasing (handles İ/i, I→ı correctly)
- Street type transformations: (Sokak)→sokak, (Cadde)→caddesi, (Bulvar)→bulvarı
- Removed (küme evler) patterns
- Removed empty/missing values
Model Architecture
The model uses BERT as a backbone with two separate classification heads:
- State Classifier: Predicts one of 81 Turkish provinces
- City Classifier: Predicts one of 954 districts/municipalities
Both classifiers share the same BERT encoder, making the model efficient and allowing it to learn shared representations for Turkish addresses.
Limitations
- Trained specifically on Turkish addresses
- May not generalize well to addresses with very different formatting
- Performance depends on the quality and coverage of training data
- Some rare city/state combinations may have lower accuracy
Citation
If you use this model, please cite:
@misc{turkish-address-classifier,
author = {ucanbaklava},
title = {Turkish Address Classifier (State + City)},
year = {2025},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/ucanbaklava/turkish-address-classifier_new}}
}
Model Card Authors
ucanbaklava
Model Card Contact
For questions or feedback, please open an issue on the model repository.
- Downloads last month
- 6