Georgian Translation Model

Model Description

This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.

Architecture

Model Type: Encoder-Decoder
Encoder: Pretrained BERT model
Decoder: Randomly initialized with custom configuration
Decoder Tokenizer: RichNachos/georgian-corpus-tokenizer-test
Parameters: 266M total parameters

Training Details

Training Data: English-Georgian parallel corpus (see Darsala/english_georgian_corpora)
Training Duration: 16 epochs
Hardware: Nvidia A100 80GB
Batch Size: 128 with 2 gradient accumulation steps
Scheduler: Cosine learning rate scheduler
Training Pipeline: Complete data cleaning, preprocessing, and augmentation pipeline

Performance

COMET Score: 0.79 (on FLORES test set)
Comparison: Google Translate (0.83), Kona (0.84) on same dataset
Translation Style: More literary and natural Georgian compared to Google Translate

Usage

Important: This model uses a custom EncoderDecoderTokenizer that is included in the repository. You need to download the repo locally to access it.

import sys
from transformers import EncoderDecoderModel
import torch
import re
from huggingface_hub import snapshot_download

# Download the repo to a local folder
path_to_downloaded = snapshot_download(
    repo_id="Darsala/Georgian-Translation",
    local_dir="./Georgian-Translation",
    local_dir_use_symlinks=False
)

# Add the downloaded folder to Python path so we can import the custom tokenizer
sys.path.append(path_to_downloaded)
from encoder_decoder_tokenizer import EncoderDecoderTokenizer

# Load the model and tokenizer from the downloaded folder
model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate(
    text: str,
    num_beams: int = 5,
    max_length: int = 256,
) -> str:
    """
    Translate a single string with the given EncoderDecoderModel.
    """
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    
    # tokenize & move to device
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="longest"
    ).to(device)
    
    # generation
    generated_ids = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        num_beams=num_beams,
        max_length=max_length,
        early_stopping=True,
    )
    
    output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"English: {text}")
    print(f"Translated: {output}")
    
    return output

# Example usage
translation = translate("Hello, how are you?")

Note: The model uses a custom EncoderDecoderTokenizer that is included in the repository.

Strengths and Limitations

Strengths

Produces more literary and natural Georgian translations
Good performance on general text translation
Specialized for Georgian language characteristics

Limitations

Struggles with proper names and company names
Issues with terms requiring direct English text copying
Limited by tokenizer coverage for certain English terms

Demo

Try the model in the interactive demo: Georgian Translation Space

Citation

@mastersthesis{darsalia2025georgian,
  title={English Translation Quality Assessment and Computer Translation},
  author={Luka Darsalia},
  year={2025},
  school={Tbilisi University},
  note={Bachelor's Thesis - Computer Science}
}

Related Resources

Training Data: english_georgian_corpora
Georgian COMET Model: georgian_comet
Evaluation Data: georgian_metric_evaluation

Downloads last month: 8

Safetensors

Model size

0.3B params

Tensor type

F32

Model tree for Darsala/Georgian-Translation

Base model

google-bert/bert-base-uncased

Finetuned

(6072)

this model

Dataset used to train Darsala/Georgian-Translation

Space using Darsala/Georgian-Translation 1

Evaluation results

COMET Score on FLORES Test Set
self-reported

0.790

View on Papers With Code