billingsmoore
/

tibetan-phonetic-transliteration

+---
+license: cc-by-nc-4.0
+language:
+- bo
+base_model: google-t5/t5-small
+tags:
+- nlp
+- transliteration
+- tibetan
+- buddhism
+---
+# Model Card for tibetan-phonetic-transliteration
+This model is a text2text generation model for phonetic transliteration of Tibetan script.
+## Model Details
+### Model Description
+<!-- Provide a longer summary of what this model is. -->
+- **Developed by:** billingsmoore
+- **Model type:** text2text generation
+- **Language(s) (NLP):** Tibetan
+- **License:** [Attribution-NonCommercial 4.0 International ](Attribution-NonCommercial 4.0 International )
+- **Finetuned from model:** ['google-t5/t5-small'](https://huggingface.co/google-t5/t5-small)
+### Model Sources
+- **Repository:** [https://github.com/billingsmoore/MLotsawa](https://github.com/billingsmoore/MLotsawa)
+## Uses
+The intended use of this model is to provide phonetic transliteration of Tibetan script, typically as part of a larger Tibetan translation ecosystem.
+### Direct Use
+To use the model for transliteration in a python script, you can use the transformers library like so:
+```python
+from transformers import pipeline
+transliterator = pipeline('translation',model='billingsmoore/tibetan-phonetic-transliteration')
+transliterated_text = transliterator(<string of unicode Tibetan script>)
+```
+### Downstream Use
+The model can be finetuned for a specific use case using the following code.
+```python
+from datasets import load_dataset
+from transformers import AutoTokenizer, DataCollatorForSeq2Seq, AutoModelForSeq2SeqLM, Seq2SeqTrainingArguments, Seq2SeqTrainer, Adafactor
+from accelerate import Accelerator
+dataset = load_dataset(<your dataset>)
+dataset = dataset['train'].train_test_split(.1)
+checkpoint = "billingsmoore/tibetan-phonetic-transliteration"
+tokenizer = AutoTokenizer.from_pretrained(checkpoint)
+model = AutoModelForSeq2SeqLM.from_pretrained(checkpoint, device_map="auto")
+data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, model=checkpoint)
+source_lang = 'bo'
+target_lang = 'phon'
+def preprocess_function(examples):
+    inputs = [example for example in examples[source_lang]]
+    targets = [example for example in examples[target_lang]]
+    model_inputs = tokenizer(inputs, text_target=targets, max_length=256, truncation=True, padding="max_length")
+    return model_inputs
+tokenized_dataset = dataset.map(preprocess_function, batched=True)
+optimizer = Adafactor(
+    model.parameters(),
+    scale_parameter=True,
+    relative_step=False,
+    warmup_init=False,
+    lr=3e-4
+)
+accelerator = Accelerator()
+model, optimizer = accelerator.prepare(model, optimizer)
+training_args = Seq2SeqTrainingArguments(
+    output_dir=".",
+    auto_find_batch_size=True,
+    predict_with_generate=True,
+    fp16=False,
+    push_to_hub=False,
+    eval_strategy='epoch',
+    save_strategy='epoch',
+    load_best_model_at_end=True,
+    num_train_epochs=5
+)
+trainer = Seq2SeqTrainer(
+    model=model,
+    args=training_args,
+    train_dataset=tokenized_dataset['train'],
+    eval_dataset=tokenized_dataset['test'],
+    tokenizer=tokenizer,
+    optimizers=(optimizer, None),
+    data_collator=data_collator
+)
+trainer.train()
+```
+## Bias, Risks, and Limitations
+This model was trained exclusively on material from the Tibetan Buddhist canon and thus on Literary Tibetan.
+It may not perform satisfactorily on texts from other corpi or on other dialects of Tibetan.
+### Recommendations
+For users who wish to use the model for other texts, I recommend further finetuning on your own dataset using the instructions above.
+## Training Details
+This model was trained on 98597 pairs of text, the first member of which is a line of unicode Tibetan text, the second (the target) is a the phonetic transliteration of the first.
+This dataset was scraped from Lotsawa House and is released on Kaggle under the same license as the texts from which it is sourced.
+[You can find this dataset and more information by clicking here.](https://www.kaggle.com/datasets/billingsmoore/tibetan-phonetic-transliteration-pairs)
+This model was trained for five epochs. Further information regarding training can be found in the documentation of the [MLotsawa repository](https://github.com/billingsmoore/MLotsawa).
+## Model Card Contact
+billingsmoore [at] gmail [dot] com