--- language: - gos - nld - nl - gos datasets: - Tom9358/tatoeba_21-dec-2024 base_model: - facebook/nllb-200-distilled-1.3B pipeline_tag: translation tags: - language - linguistics - low-resource - translation - tatoeba - nllb - machine-translation - gronings --- Moi! I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version! I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version. The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish. The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future. Update 10 September 2025: I've updated the code to the latest version of `transformers` so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data. Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.
See here a minimal example code snippet to get the model up and running: (click) ```py from transformers import AutoModelForSeq2SeqLM from transformers import NllbTokenizer MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1' model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL) tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True, additional_special_tokens=["gos_Latn"]) def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs): tokenizer.src_lang = src_lang tokenizer.tgt_lang = tgt_lang inputs = tokenizer( text, return_tensors='pt', padding='longest', truncation=True, max_length=120 ) result = model.generate( **inputs.to(model.device), forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang), max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]), **kwargs ) return tokenizer.batch_decode(result, skip_special_tokens=True) translate("Dit is een testzin om te kijken of de code werkt.") ```
See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data. A (rather slow, but at least free and accessible to everyone) way to try out the model: https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M The code there is also a minimal example of how to use this model. Don't hesitate to contact me if anything comes up!