| Hugging Face's logo | |
| --- | |
| language: | |
| - om | |
| - am | |
| - rw | |
| - rn | |
| - ha | |
| - ig | |
| - pcm | |
| - so | |
| - sw | |
| - ti | |
| - yo | |
| - multilingual | |
| tags: | |
| - T5 | |
| --- | |
| # afriteva_small | |
| ## Model desription | |
| AfriTeVa small is a sequence to sequence model pretrained on 10 African languages | |
| ## Languages | |
| Afaan Oromoo(orm), Amharic(amh), Gahuza(gah), Hausa(hau), Igbo(igb), Nigerian Pidgin(pcm), Somali(som), Swahili(swa), Tigrinya(tig), Yoruba(yor) | |
| ### More information on the model, dataset: | |
| ### The model | |
| - 64M parameters encoder-decoder architecture (T5-like) | |
| - 6 layers, 8 attention heads and 512 token sequence length | |
| ### The dataset | |
| - Multilingual: 10 African languages listed above | |
| - 143 Million Tokens (1GB of text data) | |
| - Tokenizer Vocabulary Size: 70,000 tokens | |
| ## Intended uses & limitations | |
| `afriteva_small` is pre-trained model and primarily aimed at being fine-tuned on multilingual sequence-to-sequence tasks. | |
| ```python | |
| >>> from transformers import AutoModelForSeq2SeqLM, AutoTokenizer | |
| >>> tokenizer = AutoTokenizer.from_pretrained("castorini/afriteva_small") | |
| >>> model = AutoModelForSeq2SeqLM.from_pretrained("castorini/afriteva_small") | |
| >>> src_text = "Ó hùn ọ́ láti di ara wa bí?" | |
| >>> tgt_text = "Would you like to be?" | |
| >>> model_inputs = tokenizer(src_text, return_tensors="pt") | |
| >>> with tokenizer.as_target_tokenizer(): | |
| labels = tokenizer(tgt_text, return_tensors="pt").input_ids | |
| >>> model(**model_inputs, labels=labels) # forward pass | |
| ``` | |
| ## Training Procedure | |
| For information on training procedures, please refer to the AfriTeVa [paper](#) or [repository](https://github.com/castorini/afriteva) | |
| ## BibTex entry and Citation info | |
| coming soon ... | |