johnlockejrr's picture
Update README.md
9ac8c3c verified
---
language:
- arc
tags:
- diacritization
- aramaic
- vocalization
- targum
- semitic-languages
- sequence-to-sequence
license: mit
base_model: Helsinki-NLP/opus-mt-afa-afa
library_name: transformers
---
# Aramaic Targum Diacritization (Vocalization) MarianMT Model
This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of **Aramaic diacritization**: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script.
## Model Details
- **Model Name**: `johnlockejrr/opus-arc-targum-vocalization`
- **Base Model**: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- **Task**: Aramaic diacritization (consonantal โ†’ vocalized)
- **Script**: Hebrew script (consonantal and vocalized)
- **Domain**: Targumic/Biblical Aramaic
- **License**: MIT
## Dataset
- **Source**: Consonantal Aramaic Targum text (no nikkud)
- **Target**: Fully vocalized Aramaic Targum text (with nikkud)
- **Format**: CSV with columns `consonantal` (input) and `vocalized` (target)
- **Alignment**: Verse-aligned or phrase-aligned
## Training Configuration
- **Base Model**: Helsinki-NLP/opus-mt-afa-afa
- **Batch Size**: 8 (per device, gradient accumulation as needed)
- **Learning Rate**: 1e-5
- **Epochs**: 100 (typical)
- **FP16**: Enabled
- **No language prefix** (single language, Aramaic)
- **Tokenizer**: MarianMT default
- **Max Input/Target Length**: 512
## Usage
### Inference Example
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "johnlockejrr/opus-arc-targum-vocalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Diacritize consonantal Aramaic
consonantal = "ื‘ืงื“ืžื™ืŸ ื‘ืจื ื™ื™ ื™ืช ืฉืžื™ื ื•ื™ืช ืืจืขื"
inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Vocalized: {vocalized}")
```
## Intended Use
- **Primary**: Automatic diacritization (vocalization) of Aramaic Targum text
- **Research**: Useful for digital humanities, Semitic linguistics, and textual studies
- **Education**: Can assist in language learning and textual analysis
## Limitations
- **Context**: The model is trained at the phrase/verse level and does not have document-level context
- **Domain**: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects
- **Orthography**: Input must be consonantal Aramaic in Hebrew script
- **Ambiguity**: Some words may have multiple valid vocalizations; the model predicts the most likely
## Citation
If you use this model, please cite:
```bibtex
@misc{opus-arc-targum-vocalization,
author = {John Locke Jr.},
title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face model repository},
howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}},
}
```
## Acknowledgements
- **Targumic Aramaic sources**: Public domain or open-access editions
- **Helsinki-NLP**: For the base MarianMT model
## License
MIT