|
|
--- |
|
|
language: |
|
|
- arc |
|
|
tags: |
|
|
- diacritization |
|
|
- aramaic |
|
|
- vocalization |
|
|
- targum |
|
|
- semitic-languages |
|
|
- sequence-to-sequence |
|
|
license: mit |
|
|
base_model: Helsinki-NLP/opus-mt-afa-afa |
|
|
library_name: transformers |
|
|
--- |
|
|
|
|
|
# Aramaic Targum Diacritization (Vocalization) MarianMT Model |
|
|
|
|
|
This model fine-tunes the Helsinki-NLP/opus-mt-afa-afa MarianMT model for the task of **Aramaic diacritization**: adding nikkud (vowel points) to consonantal Aramaic Targum text. The model is trained on a parallel corpus of consonantal and fully vocalized Aramaic, both in Hebrew script. |
|
|
|
|
|
## Model Details |
|
|
|
|
|
- **Model Name**: `johnlockejrr/opus-arc-targum-vocalization` |
|
|
- **Base Model**: [Helsinki-NLP/opus-mt-afa-afa](https://huggingface.co/Helsinki-NLP/opus-mt-afa-afa) |
|
|
- **Task**: Aramaic diacritization (consonantal โ vocalized) |
|
|
- **Script**: Hebrew script (consonantal and vocalized) |
|
|
- **Domain**: Targumic/Biblical Aramaic |
|
|
- **License**: MIT |
|
|
|
|
|
## Dataset |
|
|
|
|
|
- **Source**: Consonantal Aramaic Targum text (no nikkud) |
|
|
- **Target**: Fully vocalized Aramaic Targum text (with nikkud) |
|
|
- **Format**: CSV with columns `consonantal` (input) and `vocalized` (target) |
|
|
- **Alignment**: Verse-aligned or phrase-aligned |
|
|
|
|
|
## Training Configuration |
|
|
|
|
|
- **Base Model**: Helsinki-NLP/opus-mt-afa-afa |
|
|
- **Batch Size**: 8 (per device, gradient accumulation as needed) |
|
|
- **Learning Rate**: 1e-5 |
|
|
- **Epochs**: 100 (typical) |
|
|
- **FP16**: Enabled |
|
|
- **No language prefix** (single language, Aramaic) |
|
|
- **Tokenizer**: MarianMT default |
|
|
- **Max Input/Target Length**: 512 |
|
|
|
|
|
## Usage |
|
|
|
|
|
### Inference Example |
|
|
|
|
|
```python |
|
|
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM |
|
|
|
|
|
model_name = "johnlockejrr/opus-arc-targum-vocalization" |
|
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
model = AutoModelForSeq2SeqLM.from_pretrained(model_name) |
|
|
|
|
|
# Diacritize consonantal Aramaic |
|
|
consonantal = "ืืงืืืื ืืจื ืื ืืช ืฉืืื ืืืช ืืจืขื" |
|
|
inputs = tokenizer(consonantal, return_tensors="pt", max_length=512, truncation=True) |
|
|
outputs = model.generate(**inputs, max_length=512, num_beams=4) |
|
|
vocalized = tokenizer.decode(outputs[0], skip_special_tokens=True) |
|
|
print(f"Vocalized: {vocalized}") |
|
|
``` |
|
|
|
|
|
## Intended Use |
|
|
|
|
|
- **Primary**: Automatic diacritization (vocalization) of Aramaic Targum text |
|
|
- **Research**: Useful for digital humanities, Semitic linguistics, and textual studies |
|
|
- **Education**: Can assist in language learning and textual analysis |
|
|
|
|
|
## Limitations |
|
|
|
|
|
- **Context**: The model is trained at the phrase/verse level and does not have document-level context |
|
|
- **Domain**: Optimized for Targumic/Biblical Aramaic; may not generalize to other dialects |
|
|
- **Orthography**: Input must be consonantal Aramaic in Hebrew script |
|
|
- **Ambiguity**: Some words may have multiple valid vocalizations; the model predicts the most likely |
|
|
|
|
|
## Citation |
|
|
|
|
|
If you use this model, please cite: |
|
|
|
|
|
```bibtex |
|
|
@misc{opus-arc-targum-vocalization, |
|
|
author = {John Locke Jr.}, |
|
|
title = {Aramaic Targum Diacritization (Vocalization) MarianMT Model}, |
|
|
year = {2025}, |
|
|
publisher = {Hugging Face}, |
|
|
journal = {Hugging Face model repository}, |
|
|
howpublished = {\url{https://huggingface.co/johnlockejrr/opus-arc-targum-vocalization}}, |
|
|
} |
|
|
``` |
|
|
|
|
|
## Acknowledgements |
|
|
|
|
|
- **Targumic Aramaic sources**: Public domain or open-access editions |
|
|
- **Helsinki-NLP**: For the base MarianMT model |
|
|
|
|
|
## License |
|
|
|
|
|
MIT |