Tom9358 commited on
Commit
ec45c32
·
verified ·
1 Parent(s): 4dd68f6

Update README.md with details about the latest model update

Browse files
Files changed (1) hide show
  1. README.md +49 -7
README.md CHANGED
@@ -20,13 +20,55 @@ tags:
20
  - gronings
21
  ---
22
 
23
- Consider this an early beta version. I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings.
24
 
25
- I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball.
26
- I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version.
27
- Nonetheless, I found the performance to be not terrible and thus decided to publish.
28
- The model is very likely not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.
29
- See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
 
31
  A (rather slow, but at least free and accessible to everyone) way to try out the model:
32
- https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M
 
 
 
 
 
20
  - gronings
21
  ---
22
 
23
+ Moi!
24
 
25
+ I used sentence pairs from https://tatoeba.org/ to finetune an NLLB model for Gronings. Consider this an early beta version!
26
+
27
+ I am a linguist and speaker of Gronings so I carried out evaluation by expert's eyeball. I haven't thoroughly investigated the performance by means of BLEU scores or anything for this version.
28
+ The model produces something that is recognizable as Gronings when the input language is Dutch. I found that interesting enough for a PoC, so I decided to publish.
29
+
30
+ The model is not optimal in terms of hyperparameters, so I am planning to upload an even better version in the future.
31
+ Update 10 September 2025: I've updated the code to the latest version of `transformers` so that it can immediately be used by anyone without any tokenizer black magic needed. Also about 500 more parallel nld-gos sentences were added to the training data.
32
+ Only the additional Gronings language token needs to be added to the tokenizer at initialization, then everything should work.
33
+ <details>
34
+ <summary>See here a minimal example code snippet to get the model up and running: (click)</summary>
35
+
36
+ ```py
37
+ from transformers import AutoModelForSeq2SeqLM
38
+ from transformers import NllbTokenizer
39
+
40
+ MODEL_URL = 'Tom9358/nllb-tatoeba-gos-nld-v1'
41
+ model = AutoModelForSeq2SeqLM.from_pretrained(MODEL_URL)
42
+ tokenizer = NllbTokenizer.from_pretrained(MODEL_URL, force_download=True,
43
+ additional_special_tokens=["gos_Latn"])
44
+
45
+ def translate(text, src_lang: str = "nld_Latn", tgt_lang: str = "gos_Latn", **kwargs):
46
+ tokenizer.src_lang = src_lang
47
+ tokenizer.tgt_lang = tgt_lang
48
+ inputs = tokenizer(
49
+ text,
50
+ return_tensors='pt',
51
+ padding='longest',
52
+ truncation=True,
53
+ max_length=120
54
+ )
55
+ result = model.generate(
56
+ **inputs.to(model.device),
57
+ forced_bos_token_id=tokenizer.convert_tokens_to_ids(tgt_lang),
58
+ max_new_tokens=int(16 + 1.5 * inputs.input_ids.shape[1]),
59
+ **kwargs
60
+ )
61
+ return tokenizer.batch_decode(result, skip_special_tokens=True)
62
+
63
+ translate("Dit is een testzin om te kijken of de code werkt.")
64
+ ```
65
+ </details>
66
+
67
+ See https://github.com/tom9358/nllb-tryout for everything (code, more documentation and references) except the model itself and training data.
68
 
69
  A (rather slow, but at least free and accessible to everyone) way to try out the model:
70
+ https://colab.research.google.com/drive/1b5dn3VT4fvOBKly1CIx4Qwo59GDM1H-M
71
+
72
+ The code there is also a minimal example of how to use this model.
73
+
74
+ Don't hesitate to contact me if anything comes up!