Georgian Translation Model

Model Description

This is an English-to-Georgian neural machine translation model developed as part of a bachelor thesis project. The model uses an encoder-decoder architecture with a pretrained BERT encoder and a randomly initialized decoder.

Architecture

  • Model Type: Encoder-Decoder
  • Encoder: Pretrained BERT model
  • Decoder: Randomly initialized with custom configuration
  • Decoder Tokenizer: RichNachos/georgian-corpus-tokenizer-test
  • Parameters: 266M total parameters

Training Details

  • Training Data: English-Georgian parallel corpus (see Darsala/english_georgian_corpora)
  • Training Duration: 16 epochs
  • Hardware: Nvidia A100 80GB
  • Batch Size: 128 with 2 gradient accumulation steps
  • Scheduler: Cosine learning rate scheduler
  • Training Pipeline: Complete data cleaning, preprocessing, and augmentation pipeline

Performance

  • COMET Score: 0.79 (on FLORES test set)
  • Comparison: Google Translate (0.83), Kona (0.84) on same dataset
  • Translation Style: More literary and natural Georgian compared to Google Translate

Usage

Important: This model uses a custom EncoderDecoderTokenizer that is included in the repository. You need to download the repo locally to access it.

import sys
from transformers import EncoderDecoderModel
import torch
import re
from huggingface_hub import snapshot_download

# Download the repo to a local folder
path_to_downloaded = snapshot_download(
    repo_id="Darsala/Georgian-Translation",
    local_dir="./Georgian-Translation",
    local_dir_use_symlinks=False
)

# Add the downloaded folder to Python path so we can import the custom tokenizer
sys.path.append(path_to_downloaded)
from encoder_decoder_tokenizer import EncoderDecoderTokenizer

# Load the model and tokenizer from the downloaded folder
model = EncoderDecoderModel.from_pretrained(path_to_downloaded)
tokenizer = EncoderDecoderTokenizer.from_pretrained(path_to_downloaded)
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

def translate(
    text: str,
    num_beams: int = 5,
    max_length: int = 256,
) -> str:
    """
    Translate a single string with the given EncoderDecoderModel.
    """
    text = text.lower()
    text = re.sub(r'\s+', ' ', text)
    
    # tokenize & move to device
    inputs = tokenizer(
        text,
        return_tensors="pt",
        truncation=True,
        padding="longest"
    ).to(device)
    
    # generation
    generated_ids = model.generate(
        input_ids=inputs.input_ids,
        attention_mask=inputs.attention_mask,
        num_beams=num_beams,
        max_length=max_length,
        early_stopping=True,
    )
    
    output = tokenizer.decode(generated_ids[0], skip_special_tokens=True)
    print(f"English: {text}")
    print(f"Translated: {output}")
    
    return output

# Example usage
translation = translate("Hello, how are you?")

Note: The model uses a custom EncoderDecoderTokenizer that is included in the repository.

Strengths and Limitations

Strengths

  • Produces more literary and natural Georgian translations
  • Good performance on general text translation
  • Specialized for Georgian language characteristics

Limitations

  • Struggles with proper names and company names
  • Issues with terms requiring direct English text copying
  • Limited by tokenizer coverage for certain English terms

Demo

Try the model in the interactive demo: Georgian Translation Space

Citation

@mastersthesis{darsalia2025georgian,
  title={English Translation Quality Assessment and Computer Translation},
  author={Luka Darsalia},
  year={2025},
  school={Tbilisi University},
  note={Bachelor's Thesis - Computer Science}
}

Related Resources

Downloads last month
8
Safetensors
Model size
0.3B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Darsala/Georgian-Translation

Finetuned
(6072)
this model

Dataset used to train Darsala/Georgian-Translation

Space using Darsala/Georgian-Translation 1

Evaluation results