--- language: - arc tags: - diacritization - aramaic - vocalization - targum - semitic-languages - sequence-to-sequence license: mit base_model: Helsinki-NLP/opus-mt-afa-afa library_name: transformers --- # Aramaic Targum Diacritization (Vocalization) MarianMT Model This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of **Aramaic diacritization**: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script. ## Model Details - **Model Name**: `johnlockejrr/opus-arc-targum-vocalization` - **Base Model**: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa) - **Task**: Aramaic diacritization (consonantal → vocalized) - **Script**: Hebrew script (consonantal and vocalized) - **Domain**: Targumic/Biblical Aramaic - **License**: MIT ## Dataset - **Source**: Consonantal Aramaic Targum text (no nikkud) - **Target**: Fully vocalized Aramaic Targum text (with nikkud) - **Format**: CSV with columns `consonantal` (input) and `vocalized` (target) - **Alignment**: Verse-aligned or phrase-aligned ## Training Configuration - **Base Model**: Helsinki-NLP/opus-mt-afa-afa - **Batch Size**: 8 (per device, gradient accumulation as needed) - **Learning Rate**: 1e-5 - **Epochs**: 100 (typical) - **FP16**: Enabled - **No language prefix** (single language, Aramaic) - **Tokenizer**: MarianMT default - **Max Input/Target Length**: 512 ## Usage ### Inference Example ```python from transformers import AutoTokenizer, AutoModelForSeq2SeqLM model_name = "johnlockejrr/opus-arc-targum-vocalization" tokenizer = AutoTokenizer.from_pretrained(model_name) model = AutoModelForSeq2SeqLM.from_pretrained(model_name) # Diacritize consonantal Aramaic consonantal = "בקדמין ברא יי ית שמיא וית ארעא" inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True) outputs = model.generate(**inputs, max_length=512, num_beams=4) vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True) print(f"Vocalized: {vocalized}") ``` ## Intended Use - **Primary**: Automatic diacritization (vocalization) of Aramaic Targum text - **Research**: Useful for digital humanities, Semitic linguistics, and textual studies - **Education**: Can assist in language learning and textual analysis ## Limitations - **Context**: The model is trained at the phrase/verse level and does not have document-level context - **Domain**: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects - **Orthography**: Input must be consonantal Aramaic in Hebrew script - **Ambiguity**: Some words may have multiple valid vocalizations; the model predicts the most likely ## Citation If you use this model, please cite: ```bibtex @misc{opus-arc-targum-vocalization, author = {John Locke Jr.}, title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model}, year = {2025}, publisher = {Hugging Face}, journal = {Hugging Face model repository}, howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}}, } ``` ## Acknowledgements - **Targumic Aramaic sources**: Public domain or open-access editions - **Helsinki-NLP**: For the base MarianMT model ## License MIT