File size: 3,334 Bytes
1c18d82 9ac8c3c 1c18d82 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 |
---
language:
- arc
tags:
- diacritization
- aramaic
- vocalization
- targum
- semitic-languages
- sequence-to-sequence
license: mit
base_model: Helsinki-NLP/opus-mt-afa-afa
library_name: transformers
---
# Aramaic Targum Diacritization (Vocalization) MarianMT Model
This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of **Aramaic diacritization**: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script.
## Model Details
- **Model Name**: `johnlockejrr/opus-arc-targum-vocalization`
- **Base Model**: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa)
- **Task**: Aramaic diacritization (consonantal โ vocalized)
- **Script**: Hebrew script (consonantal and vocalized)
- **Domain**: Targumic/Biblical Aramaic
- **License**: MIT
## Dataset
- **Source**: Consonantal Aramaic Targum text (no nikkud)
- **Target**: Fully vocalized Aramaic Targum text (with nikkud)
- **Format**: CSV with columns `consonantal` (input) and `vocalized` (target)
- **Alignment**: Verse-aligned or phrase-aligned
## Training Configuration
- **Base Model**: Helsinki-NLP/opus-mt-afa-afa
- **Batch Size**: 8 (per device, gradient accumulation as needed)
- **Learning Rate**: 1e-5
- **Epochs**: 100 (typical)
- **FP16**: Enabled
- **No language prefix** (single language, Aramaic)
- **Tokenizer**: MarianMT default
- **Max Input/Target Length**: 512
## Usage
### Inference Example
```python
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_name = "johnlockejrr/opus-arc-targum-vocalization"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Diacritize consonantal Aramaic
consonantal = "ืืงืืืื ืืจื ืื ืืช ืฉืืื ืืืช ืืจืขื"
inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True)
outputs = model.generate(**inputs, max_length=512, num_beams=4)
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(f"Vocalized: {vocalized}")
```
## Intended Use
- **Primary**: Automatic diacritization (vocalization) of Aramaic Targum text
- **Research**: Useful for digital humanities, Semitic linguistics, and textual studies
- **Education**: Can assist in language learning and textual analysis
## Limitations
- **Context**: The model is trained at the phrase/verse level and does not have document-level context
- **Domain**: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects
- **Orthography**: Input must be consonantal Aramaic in Hebrew script
- **Ambiguity**: Some words may have multiple valid vocalizations; the model predicts the most likely
## Citation
If you use this model, please cite:
```bibtex
@misc{opus-arc-targum-vocalization,
author = {John Locke Jr.},
title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model},
year = {2025},
publisher = {Hugging Face},
journal = {Hugging Face model repository},
howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}},
}
```
## Acknowledgements
- **Targumic Aramaic sources**: Public domain or open-access editions
- **Helsinki-NLP**: For the base MarianMT model
## License
MIT |