MarianMT ONNX Model for Samaritan Hebrew โ Samaritan Aramaic Translation
Model Description
This is the ONNX (Open Neural Network Exchange) format version of the bidirectional translation model fine-tuned from Helsinki-NLP/opus-mt-sem-sem for translating between Samaritan Hebrew (smp) and Samaritan Aramaic (sam). The model supports both translation directions using special language tags (>>smp<< and >>sam<<).
This ONNX conversion enables:
- Faster inference with optimized ONNX Runtime
- Cross-platform deployment (CPU, GPU, mobile, edge devices)
- Lower memory footprint compared to PyTorch models
- Production-ready inference in various environments
For the original PyTorch model, see: johnlockejrr/marianmt-smp-sam
Model Details
- Model Type: Seq2Seq (Marian) - ONNX Format
- Base Model: Helsinki-NLP/opus-mt-sem-sem
- Languages: Samaritan Hebrew (smp) โ Samaritan Aramaic (sam)
- Direction: Bidirectional
- Vocabulary Size: 33,702 tokens (2 additional special tokens:
>>smp<<and>>sam<<) - Model Parameters: 61,918,208
- Input/Output Max Length: 313 tokens
- Format: ONNX (Open Neural Network Exchange)
Training Details
Training Configuration
- Training Epochs: 96.35 (of 100 planned, early stopping at step 29,000)
- Batch Size: 16 per device
- Effective Batch Size: 32 (with gradient accumulation)
- Learning Rate: 1e-5
- Warmup Steps: 1,000
- Weight Decay: 0.01
- Gradient Accumulation Steps: 2
- Optimization: AdamW with cosine learning rate schedule with restarts
- Precision: bfloat16 (BF16)
- Training Time: ~47.8 minutes (2,866 seconds)
Dataset
- Train Split: 9,610 sentence pairs (4,805 original bidirectional pairs)
- Validation Split: 1,080 sentence pairs (540 original pairs)
- Test Split: 108 sentence pairs (54 original pairs)
- Total Dataset: 10,798 bidirectional sentence pairs from biblical parallel texts
- Format: Pipe-delimited CSV with columns: Book|Chapter|Verse|Samaritan|Aramaic
- Script: Hebrew script for both languages
The dataset contains parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic (Targumic), with both directions included in the training data to enable bidirectional translation.
Training Process
Training was conducted with:
- Early stopping patience: 5 evaluation steps
- Evaluation every 500 steps
- Best model checkpoint: checkpoint-26500 (BLEU: 60.48)
- Final checkpoint: checkpoint-29000 (BLEU: 59.72 after 96.35 epochs)
Note: The ONNX model was converted from the trained PyTorch model and preserves the same performance metrics.
Performance
Evaluation Metrics (Test Set)
- BLEU Score: 59.72 (best: 60.48 at checkpoint-26500)
- chrF Score: 77.91
- Character Accuracy: 51.09%
These metrics are identical to the original PyTorch model, as the ONNX conversion is lossless.
Training Metrics
- Final Training Loss: 0.722
- Final Evaluation Loss: 0.825
- Best BLEU (validation): 60.48 at step 26,500
Installation
pip install onnxruntime transformers
For GPU acceleration (optional):
pip install onnxruntime-gpu
Usage
Inference with ONNX Runtime
import numpy as np
from transformers import AutoTokenizer
import onnxruntime as ort
# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")
# Load ONNX model
session = ort.InferenceSession("model.onnx")
# Translate from Samaritan Hebrew to Samaritan Aramaic
text_smp = "ืืืจ ืืืืจืื ืืืื ืืื ืืืจ ืืืื ืื ืืืจื"
input_text = f">>smp<< {text_smp}"
inputs = tokenizer(input_text, return_tensors="np", max_length=313, truncation=True)
# Prepare inputs for ONNX (convert to numpy arrays)
onnx_inputs = {
"input_ids": inputs["input_ids"].astype(np.int64),
"attention_mask": inputs["attention_mask"].astype(np.int64),
}
# Run inference
outputs = session.run(None, onnx_inputs)
output_ids = outputs[0]
# Decode the output
translation = tokenizer.decode(output_ids[0], skip_special_tokens=True)
print(translation)
Using Optimum for Seamless ONNX Inference
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer
# Load model and tokenizer
model = ORTModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")
tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")
# Translate from Samaritan Hebrew to Samaritan Aramaic
text_smp = "ืืืจ ืืืืจืื ืืืื ืืื ืืืจ ืืืื ืื ืืืจื"
input_text = f">>smp<< {text_smp}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
# Generate translation
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
# Translate from Samaritan Aramaic to Samaritan Hebrew
text_sam = "ืืชืจ ืืืืืื ืืืื ืืื ืืื ืืืื ืขื ืืืจื"
input_text = f">>sam<< {text_sam}"
inputs = tokenizer(input_text, return_tensors="pt", max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translation = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(translation)
Batch Inference
from optimum.onnxruntime import ORTModelForSeq2SeqLM
from transformers import AutoTokenizer
model = ORTModelForSeq2SeqLM.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")
tokenizer = AutoTokenizer.from_pretrained("johnlockejrr/marianmt-smp-sam-onnx")
texts = [
">>smp<< ืืืจ ืืืืจืื ืืืื ืืื ืืืจ ืืืื ืื ืืืจื",
">>sam<< ืืชืจ ืืืืืื ืืืื ืืื ืืื ืืืื ืขื ืืืจื"
]
inputs = tokenizer(texts, return_tensors="pt", padding=True, max_length=313, truncation=True)
outputs = model.generate(**inputs, max_length=313, num_beams=4, length_penalty=0.6)
translations = tokenizer.batch_decode(outputs, skip_special_tokens=True)
for translation in translations:
print(translation)
Language Tags
The model uses special language tags to indicate translation direction:
>>smp<<- Prefix for Samaritan Hebrew (source) โ Samaritan Aramaic (target)>>sam<<- Prefix for Samaritan Aramaic (source) โ Samaritan Hebrew (target)
These tags must be included at the beginning of the input text for proper direction control.
ONNX-Specific Advantages
- Performance: ONNX Runtime provides optimized inference, often faster than PyTorch for production workloads
- Portability: Run on various platforms (Windows, Linux, macOS, Android, iOS, Web)
- Hardware Support: Optimized for different hardware (CPU, GPU, NPU, TPU)
- Memory Efficiency: Lower memory footprint compared to PyTorch models
- Production Ready: Better suited for deployment in production environments
Limitations and Considerations
- Domain Specificity: The model was trained primarily on biblical texts and may perform better on similar religious or historical texts.
- Script Normalization: Input texts may need normalization (removal of diacritics/niqqud) depending on your use case.
- Length Constraints: Maximum sequence length is 313 tokens. Longer texts will be truncated.
- Character Accuracy: At 51.09%, character-level accuracy indicates room for improvement, though BLEU and chrF scores suggest reasonable translation quality.
- ONNX Limitations: Some advanced generation features (like sampling with temperature) may have limited support compared to PyTorch. Beam search and greedy decoding are fully supported.
Citation
If you use this model, please cite:
@misc{marianmt-smp-sam-onnx,
title={MarianMT ONNX Model for Samaritan Hebrew โ Samaritan Aramaic Translation},
author={johnlockejrr},
year={2025},
howpublished={\url{https://huggingface.co/johnlockejrr/marianmt-smp-sam-onnx}}
}
Related Models
- PyTorch Version: johnlockejrr/marianmt-smp-sam - Original PyTorch model for research and development
Acknowledgments
- Base model: Helsinki-NLP/opus-mt-sem-sem
- Training framework: Hugging Face Transformers
- ONNX conversion: Optimum / ONNX Runtime
- Dataset: Parallel biblical texts in Samaritan Hebrew and Samaritan Aramaic
Model Card Contact
For questions, issues, or contributions, please refer to the model repository at https://huggingface.co/johnlockejrr/marianmt-smp-sam-onnx.
- Downloads last month
- 24
Evaluation results
- BLEU Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpusself-reported60.480
- chrF Score on Samaritan Hebrew-Samaritan Aramaic Parallel Corpusself-reported77.910
- Character Accuracy on Samaritan Hebrew-Samaritan Aramaic Parallel Corpusself-reported51.090