MarianMT ONNX Model for Samaritan Hebrew ↔ Samaritan Aramaic Translation

Model Description

This is the ONNX (Open Neural Network Exchange) format version of the bidirectional translation model fine-tuned from Helsinki-NLP/opus-mt-sem-sem for translating between Samaritan Hebrew (smp) and Samaritan Aramaic (sam). The model supports both translation directions using special language tags (>>smp<< and >>sam<<).

This ONNX conversion enables:

Faster inference with optimized ONNX Runtime
Cross-platform deployment (CPU, GPU, mobile, edge devices)
Lower memory footprint compared to PyTorch models
Production-ready inference in various environments

For the original PyTorch model, see: johnlockejrr/marianmt-smp-sam

Model Details

Model Type: Seq2Seq (Marian) - ONNX Format
Base Model: Helsinki-NLP/opus-mt-sem-sem
Languages: Samaritan Hebrew (smp) ↔ Samaritan Aramaic (sam)
Direction: Bidirectional
Vocabulary Size: 33,702 tokens (2 additional special tokens: >>smp<< and >>sam<<)
Model Parameters: 61,918,208
Input/Output Max Length: 313 tokens
Format: ONNX (Open Neural Network Exchange)

Training Details

Training Configuration

Training Epochs: 96.35 (of 100 planned, early stopping at step 29,000)
Batch Size: 16 per device
Effective Batch Size: 32 (with gradient accumulation)
Learning Rate: 1e-5
Warmup Steps: 1,000
Weight Decay: 0.01
Gradient Accumulation Steps: 2
Optimization: AdamW with cosine learning rate schedule with restarts
Precision: bfloat16 (BF16)
Training Time: ~47.8 minutes (2,866 seconds)

Dataset

Train Split: 9,610 sentence pairs (4,805 original bidirectional pairs)
Validation Split: 1,080 sentence pairs (540 original pairs)
Test Split: 108 sentence pairs (54 original pairs)
Total Dataset: 10,798 bidirectional sentence pairs from biblical parallel texts
Format: Pipe-delimited CSV with columns: Book|Chapter|Verse|Samaritan|Aramaic
Script: Hebrew script for both languages

The dataset contains parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic (Targumic), with both directions included in the training data to enable bidirectional translation.

Training Process

Training was conducted with:

Early stopping patience: 5 evaluation steps
Evaluation every 500 steps
Best model checkpoint: checkpoint-26500 (BLEU: 60.48)
Final checkpoint: checkpoint-29000 (BLEU: 59.72 after 96.35 epochs)

Note: The ONNX model was converted from the trained PyTorch model and preserves the same performance metrics.

Performance

Evaluation Metrics (Test Set)

BLEU Score: 59.72 (best: 60.48 at checkpoint-26500)
chrF Score: 77.91
Character Accuracy: 51.09%

These metrics are identical to the original PyTorch model, as the ONNX conversion is lossless.

Training Metrics

Final Training Loss: 0.722
Final Evaluation Loss: 0.825
Best BLEU (validation): 60.48 at step 26,500

Installation

pip install onnxruntime transformers

For GPU acceleration (optional):

pip install onnxruntime-gpu

Usage

Inference with ONNX Runtime

import numpy as np
from transformers import AutoTokenizer
import onnxruntime as ort

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")

# Load ONNX model
session = ort.InferenceSession("model.onnx")

# Translate from Samaritan Hebrew to Samaritan Aramaic
text_smp = "אחר הדברים האלה היה דבר יהוה אל אברם"
input_text = f">>smp<< {text_smp}"
inputs = tokenizer(input_text, return_tensors="np", max_length=313, truncation=True)

# Prepare inputs for ONNX (convert to numpy arrays)
onnx_inputs = {
    "input_ids": inputs["input_ids"].astype(np.int64),
    "attention_mask": inputs["attention_mask"].astype(np.int64),
}

# Run inference
outputs = session.run(None, onnx_inputs)
output_ids = outputs[0]

# Decode the output
translation = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(translation)

Using Optimum for Seamless ONNX Inference

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer

# Load model and tokenizer
model = ORTModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")
tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")

# Translate from Samaritan Hebrew to Samaritan Aramaic
text_smp = "אחר הדברים האלה היה דבר יהוה אל אברם"
input_text = f">>smp<< {text_smp}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)

# Generate translation
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

# Translate from Samaritan Aramaic to Samaritan Hebrew
text_sam = "בתר ממלליה אלין הוה מלל יהוה עם אברם"
input_text = f">>sam<< {text_sam}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)

outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)

Batch Inference

from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer

model = ORTModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")
tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")

texts = [
    ">>smp<< אחר הדברים האלה היה דבר יהוה אל אברם",
    ">>sam<< בתר ממלליה אלין הוה מלל יהוה עם אברם"
]

inputs = tokenizer(texts, return_tensors="pt", padding=True, max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)

for translation in translations:
    print(translation)

Language Tags

The model uses special language tags to indicate translation direction:

>>smp<< - Prefix for Samaritan Hebrew (source) → Samaritan Aramaic (target)
>>sam<< - Prefix for Samaritan Aramaic (source) → Samaritan Hebrew (target)

These tags must be included at the beginning of the input text for proper direction control.

ONNX-Specific Advantages

Performance: ONNX Runtime provides optimized inference, often faster than PyTorch for production workloads
Portability: Run on various platforms (Windows, Linux, macOS, Android, iOS, Web)
Hardware Support: Optimized for different hardware (CPU, GPU, NPU, TPU)
Memory Efficiency: Lower memory footprint compared to PyTorch models
Production Ready: Better suited for deployment in production environments

Limitations and Considerations

Domain Specificity: The model was trained primarily on biblical texts and may perform better on similar religious or historical texts.
Script Normalization: Input texts may need normalization (removal of diacritics/niqqud) depending on your use case.
Length Constraints: Maximum sequence length is 313 tokens. Longer texts will be truncated.
Character Accuracy: At 51.09%, character-level accuracy indicates room for improvement, though BLEU and chrF scores suggest reasonable translation quality.
ONNX Limitations: Some advanced generation features (like sampling with temperature) may have limited support compared to PyTorch. Beam search and greedy decoding are fully supported.

Citation

If you use this model, please cite:

@misc{marianmt-smp-sam-onnx,
  title={MarianMT ONNX Model for Samaritan Hebrew ↔ Samaritan Aramaic Translation},
  author={johnlockejrr},
  year={2025},
  howpublished={\url{https://huggingface.co/johnlockejrr/marianmt-smp-sam-onnx}}
}

Related Models

PyTorch Version: johnlockejrr/marianmt-smp-sam - Original PyTorch model for research and development

Acknowledgments

Base model: Helsinki-NLP/opus-mt-sem-sem
Training framework: Hugging Face Transformers
ONNX conversion: Optimum / ONNX Runtime
Dataset: Parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic

Model Card Contact

For questions, issues, or contributions, please refer to the model repository at https://huggingface.co/johnlockejrr/marianmt-smp-sam-onnx.

Downloads last month: 24

Evaluation results

BLEU Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
self-reported

60.480
chrF Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
self-reported

77.910
Character Accuracy on Samaritan Hebrew-Samaritan Aramaic Parallel Corpus
self-reported

51.090

View on Papers With Code