Georgian Translation Model
Model Description
This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.
Architecture
- Model Type: Encoder-Decoder
- Encoder: Pretrained BERT model
- Decoder: Randomly initialized with custom configuration
- Decoder Tokenizer:
RichNachos/georgian-corpus-tokenizer-test - Parameters: 266M total parameters
Training Details
- Training Data: English-Georgian parallel corpus (see Darsala/english_georgian_corpora)
- Training Duration: 16 epochs
- Hardware: Nvidia A100 80GB
- Batch Size: 128 with 2 gradient accumulation steps
- Scheduler: Cosine learning rate scheduler
- Training Pipeline: Complete data cleaning, preprocessing, and augmentation pipeline
Performance
- COMET Score: 0.79 (on FLORES test set)
- Comparison: Google Translate (0.83), Kona (0.84) on same dataset
- Translation Style: More literary and natural Georgian compared to Google Translate
Usage
Important: This model uses a custom EncoderDecoderTokenizer that is included in the repository. You need to download the repo locally to access it.
import sys
from transformers import EncoderDecoderModel
import torch
import re
from huggingface_hub import snapshot_download
# Download the repo to a local folder
path_to_downloaded = snapshot_download(
repo_id="Darsala/Georgian-Translation",
local_dir="./Georgian-Translation",
local_dir_use_symlinks=False
)
# Add the downloaded folder to Python path so we can import the custom tokenizer
sys.path.append(path_to_downloaded)
from encoder_decoder_tokenizer import EncoderDecoderTokenizer
# Load the model and tokenizer from the downloaded folder
model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
def translate(
text: str,
num_beams: int = 5,
max_length: int = 256,
) -> str:
"""
Translate a single string with the given EncoderDecoderModel.
"""
text = text.lower()
text = re.sub(r'\s+', ' ', text)
# tokenize & move to device
inputs = tokenizer(
text,
return_tensors="pt",
truncation=True,
padding="longest"
).to(device)
# generation
generated_ids = model.generate(
input_ids=inputs.input_ids,
attention_mask=inputs.attention_mask,
num_beams=num_beams,
max_length=max_length,
early_stopping=True,
)
output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
print(f"English: {text}")
print(f"Translated: {output}")
return output
# Example usage
translation = translate("Hello, how are you?")
Note: The model uses a custom EncoderDecoderTokenizer that is included in the repository.
Strengths and Limitations
Strengths
- Produces more literary and natural Georgian translations
- Good performance on general text translation
- Specialized for Georgian language characteristics
Limitations
- Struggles with proper names and company names
- Issues with terms requiring direct English text copying
- Limited by tokenizer coverage for certain English terms
Demo
Try the model in the interactive demo: Georgian Translation Space
Citation
@mastersthesis{darsalia2025georgian,
title={English Translation Quality Assessment and Computer Translation},
author={Luka Darsalia},
year={2025},
school={Tbilisi University},
note={Bachelor's Thesis - Computer Science}
}
Related Resources
- Training Data: english_georgian_corpora
- Georgian COMET Model: georgian_comet
- Evaluation Data: georgian_metric_evaluation
- Downloads last month
- 8
Model tree for Darsala/Georgian-Translation
Base model
google-bert/bert-base-uncased