File size: 2,822 Bytes
3b045b5
 
 
 
 
e075d22
3b045b5
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
ec45c32
3b045b5
ec45c32
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
55bb6ae
 
ec45c32
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
---
language:
- gos
- nld
- nl
- gos
datasets:
- Tom9358/tatoeba_21-dec-2024
base_model:
- facebook/nllb-200-distilled-1.3B
pipeline_tag: translation
tags:
- language
- linguistics
- low-resource
- translation
- tatoeba
- nllb
- machine-translation
- gronings
---

Moi!

I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version!

I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version.
The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish.

The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.
Update 10 September 2025: I've updated the code to the latest version of `transformers` so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data.
Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.
<details>
  <summary>See here a minimal example code snippet to get the model up and running: (click)</summary>

```py
from transformers import AutoModelForSeq2SeqLM
from transformers import NllbTokenizer

MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
                                          additional_special_tokens=["gos_Latn"])

def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
    tokenizer.src_lang = src_lang
    tokenizer.tgt_lang = tgt_lang
    inputs = tokenizer(
        text,
        return_tensors='pt',
        padding='longest',
        truncation=True,
        max_length=120
    )
    result = model.generate(
        **inputs.to(model.device),
        forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
        max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]),
        **kwargs
    )
    return tokenizer.batch_decode(result, skip_special_tokens=True)

translate("Dit is een testzin om te kijken of de code werkt.")
```
</details>

See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.

A (rather slow, but at least free and accessible to everyone) way to try out the model:
https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M

The code there is also a minimal example of how to use this model.

Don't hesitate to contact me if anything comes up!