Aramaic Targum Diacritization (Vocalization) MarianMT Model

This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of Aramaic diacritization: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script.

Model Details

  • Model Name: johnlockejrr/opus-arc-targum-vocalization
  • Base Model: Helsinki-NLP/opus-mt-afa-afa
  • Task: Aramaic diacritization (consonantal โ†’ vocalized)
  • Script: Hebrew script (consonantal and vocalized)
  • Domain: Targumic/Biblical Aramaic
  • License: MIT

Dataset

  • Source: Consonantal Aramaic Targum text (no nikkud)
  • Target: Fully vocalized Aramaic Targum text (with nikkud)
  • Format: CSV with columns consonantal (input) and vocalized (target)
  • Alignment: Verse-aligned or phrase-aligned

Training Configuration

  • Base Model: Helsinki-NLP/opus-mt-afa-afa
  • Batch Size: 8 (per device, gradient accumulation as needed)
  • Learning Rate: 1e-5
  • Epochs: 100 (typical)
  • FP16: Enabled
  • No language prefix (single language, Aramaic)
  • Tokenizer: MarianMT default
  • Max Input/Target Length: 512

Usage

Inference Example

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_name = "johnlockejrr/opus-arc-targum-vocalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)

# Diacritize consonantal Aramaic
consonantal = "ื‘ืงื“ืžื™ืŸ ื‘ืจื ื™ื™ ื™ืช ืฉืžื™ื ื•ื™ืช ืืจืขื"
inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Vocalized: {vocalized}")

Intended Use

  • Primary: Automatic diacritization (vocalization) of Aramaic Targum text
  • Research: Useful for digital humanities, Semitic linguistics, and textual studies
  • Education: Can assist in language learning and textual analysis

Limitations

  • Context: The model is trained at the phrase/verse level and does not have document-level context
  • Domain: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects
  • Orthography: Input must be consonantal Aramaic in Hebrew script
  • Ambiguity: Some words may have multiple valid vocalizations; the model predicts the most likely

Citation

If you use this model, please cite:

@misc{opus-arc-targum-vocalization,
  author = {John Locke Jr.},
  title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face model repository},
  howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}},
}

Acknowledgements

  • Targumic Aramaic sources: Public domain or open-access editions
  • Helsinki-NLP: For the base MarianMT model

License

MIT

Downloads last month
1
Safetensors
Model size
61.4M params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for johnlockejrr/opus-arc-targum-vocalization

Finetuned
(2)
this model

Space using johnlockejrr/opus-arc-targum-vocalization 1