tahrirchi
/

dilmash-til

@@ -22,11 +22,16 @@ This repository contains a collection of machine translation models for the Kara
 We provide three variants of our Karakalpak translation model:
-| Model | Base Model | Parameters | Tokenizer Length | Datasets | Languages |
-|-------|------------|------------|-------------------|----------|-----------|
-| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 615M | 256,204 | [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash) | Karakalpak, Uzbek, Russian, English |
-| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M) | 629M | 269,399 | [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash) | Karakalpak, Uzbek, Russian, English |
-| **[`dilmash-TIL`](https://huggingface.co/tahrirchi/dilmash-TIL)** | **[nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)** | **629M** | **269,399** | **[Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash), TIL corpus** | **Karakalpak, Uzbek, Russian, English** |
 ## Intended uses & limitations
@@ -67,18 +72,21 @@ The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmas
 ## Training procedure
-For full details of the training procedure, please refer to our paper (coming soon!).
 ## Citation
 If you use these models in your research, please cite our paper:
 ```bibtex
-@inproceedings{mamasaidov2024advancing,
-  title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
-  author={Mamasaidov, Mukhammadsaid and Shopulatov, Abror},
-  booktitle={Proceedings of the OLDI Workshop},
-  year={2024}
 }
 ```
@@ -92,6 +100,9 @@ We are thankful to these awesome organizations and people for helping to make it
  - [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
  - Ajiniyaz Nurniyazov: for advise throughout the process
 ## Contacts
 We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.

 We provide three variants of our Karakalpak translation model:
+| Model | Tokenizer Length | Parameter Count | Unique Features |
+|-------|------------|-------------------|-----------------|
+| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | 256,204 | 615M | Original NLLB tokenizer |
+| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer |
+| [**`dilmash-TIL`**](https://huggingface.co/tahrirchi/dilmash-TIL) | **269,399** | **629M** | **Additional TIL corpus** |
+**Common attributes:**
+- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
+- **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash)
+- **Languages:** Karakalpak, Uzbek, Russian, English
 ## Intended uses & limitations
 ## Training procedure
+For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269).
 ## Citation
 If you use these models in your research, please cite our paper:
 ```bibtex
+@misc{mamasaidov2024openlanguagedatainitiative,
+      title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
+      author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
+      year={2024},
+      eprint={2409.04269},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2409.04269},
 }
 ```
  - [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
  - Ajiniyaz Nurniyazov: for advise throughout the process
+We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.
 ## Contacts
 We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.