Tom9358
/

nllb-tatoeba-gos-nld-v1

@@ -20,13 +20,55 @@ tags:
 - gronings
 ---
-Consider this an early beta version. I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings.
-I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball.
-I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version.
-Nonetheless, I found the performance to be not terrible and thus decided to publish.
-The model is very likely not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.
-See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself
 A (rather slow, but at least free and accessible to everyone) way to try out the model:
-https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M

 - gronings
 ---
+Moi!
+I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version!
+I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version.
+The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish.
+The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.
+Update 10 September 2025: I've updated the code to the latest version of `transformers` so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data.
+Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.
+<details>
+  <summary>See here a minimal example code snippet to get the model up and running: (click)</summary>
+```py
+from transformers import AutoModelForSeq2SeqLM
+from transformers import NllbTokenizer
+MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
+model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
+tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
+                                          additional_special_tokens=["gos_Latn"])
+def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
+    tokenizer.src_lang = src_lang
+    tokenizer.tgt_lang = tgt_lang
+    inputs = tokenizer(
+        text,
+        return_tensors='pt',
+        padding='longest',
+        truncation=True,
+        max_length=120
+    )
+    result = model.generate(
+        **inputs.to(model.device),
+        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
+        max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]),
+        **kwargs
+    )
+    return tokenizer.batch_decode(result, skip_special_tokens=True)
+translate("Dit is een testzin om te kijken of de code werkt.")
+```
+</details>
+See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.
 A (rather slow, but at least free and accessible to everyone) way to try out the model:
+https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M
+The code there is also a minimal example of how to use this model.
+Don't hesitate to contact me if anything comes up!