Update README.md with details about the latest model update
Browse files
README.md
CHANGED
|
@@ -20,13 +20,55 @@ tags:
|
|
| 20 |
- gronings
|
| 21 |
---
|
| 22 |
|
| 23 |
-
|
| 24 |
|
| 25 |
-
I
|
| 26 |
-
|
| 27 |
-
|
| 28 |
-
The model is
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
A (rather slow, but at least free and accessible to everyone) way to try out the model:
|
| 32 |
-
https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 20 |
- gronings
|
| 21 |
---
|
| 22 |
|
| 23 |
+
Moi!
|
| 24 |
|
| 25 |
+
I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version!
|
| 26 |
+
|
| 27 |
+
I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version.
|
| 28 |
+
The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish.
|
| 29 |
+
|
| 30 |
+
The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.
|
| 31 |
+
Update 10 September 2025: I've updated the code to the latest version of `transformers` so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data.
|
| 32 |
+
Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.
|
| 33 |
+
<details>
|
| 34 |
+
<summary>See here a minimal example code snippet to get the model up and running: (click)</summary>
|
| 35 |
+
|
| 36 |
+
```py
|
| 37 |
+
from transformers import AutoModelForSeq2SeqLM
|
| 38 |
+
from transformers import NllbTokenizer
|
| 39 |
+
|
| 40 |
+
MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
|
| 41 |
+
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
|
| 42 |
+
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
|
| 43 |
+
additional_special_tokens=["gos_Latn"])
|
| 44 |
+
|
| 45 |
+
def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
|
| 46 |
+
tokenizer.src_lang = src_lang
|
| 47 |
+
tokenizer.tgt_lang = tgt_lang
|
| 48 |
+
inputs = tokenizer(
|
| 49 |
+
text,
|
| 50 |
+
return_tensors='pt',
|
| 51 |
+
padding='longest',
|
| 52 |
+
truncation=True,
|
| 53 |
+
max_length=120
|
| 54 |
+
)
|
| 55 |
+
result = model.generate(
|
| 56 |
+
**inputs.to(model.device),
|
| 57 |
+
forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
|
| 58 |
+
max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]),
|
| 59 |
+
**kwargs
|
| 60 |
+
)
|
| 61 |
+
return tokenizer.batch_decode(result, skip_special_tokens=True)
|
| 62 |
+
|
| 63 |
+
translate("Dit is een testzin om te kijken of de code werkt.")
|
| 64 |
+
```
|
| 65 |
+
</details>
|
| 66 |
+
|
| 67 |
+
See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.
|
| 68 |
|
| 69 |
A (rather slow, but at least free and accessible to everyone) way to try out the model:
|
| 70 |
+
https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M
|
| 71 |
+
|
| 72 |
+
The code there is also a minimal example of how to use this model.
|
| 73 |
+
|
| 74 |
+
Don't hesitate to contact me if anything comes up!
|