changing citation and some minor changes
Browse files
README.md
CHANGED
|
@@ -22,11 +22,16 @@ This repository contains a collection of machine translation models for the Kara
|
|
| 22 |
|
| 23 |
We provide three variants of our Karakalpak translation model:
|
| 24 |
|
| 25 |
-
| Model |
|
| 26 |
-
|
| 27 |
-
| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) |
|
| 28 |
-
| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) |
|
| 29 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 30 |
|
| 31 |
## Intended uses & limitations
|
| 32 |
|
|
@@ -67,18 +72,21 @@ The dataset is available [here](https://huggingface.co/datasets/tahrirchi/dilmas
|
|
| 67 |
|
| 68 |
## Training procedure
|
| 69 |
|
| 70 |
-
For full details of the training procedure, please refer to our paper
|
| 71 |
|
| 72 |
## Citation
|
| 73 |
|
| 74 |
If you use these models in your research, please cite our paper:
|
| 75 |
|
| 76 |
```bibtex
|
| 77 |
-
@
|
| 78 |
-
|
| 79 |
-
|
| 80 |
-
|
| 81 |
-
|
|
|
|
|
|
|
|
|
|
| 82 |
}
|
| 83 |
```
|
| 84 |
|
|
@@ -92,6 +100,9 @@ We are thankful to these awesome organizations and people for helping to make it
|
|
| 92 |
- [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
|
| 93 |
- Ajiniyaz Nurniyazov: for advise throughout the process
|
| 94 |
|
|
|
|
|
|
|
|
|
|
| 95 |
## Contacts
|
| 96 |
|
| 97 |
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.
|
|
|
|
| 22 |
|
| 23 |
We provide three variants of our Karakalpak translation model:
|
| 24 |
|
| 25 |
+
| Model | Tokenizer Length | Parameter Count | Unique Features |
|
| 26 |
+
|-------|------------|-------------------|-----------------|
|
| 27 |
+
| [`dilmash-raw`](https://huggingface.co/tahrirchi/dilmash-raw) | 256,204 | 615M | Original NLLB tokenizer |
|
| 28 |
+
| [`dilmash`](https://huggingface.co/tahrirchi/dilmash) | 269,399 | 629M | Expanded tokenizer |
|
| 29 |
+
| [**`dilmash-TIL`**](https://huggingface.co/tahrirchi/dilmash-TIL) | **269,399** | **629M** | **Additional TIL corpus** |
|
| 30 |
+
|
| 31 |
+
**Common attributes:**
|
| 32 |
+
- **Base Model:** [nllb-200-600M](https://huggingface.co/facebook/nllb-200-distilled-600M)
|
| 33 |
+
- **Primary Dataset:** [Dilmash corpus](https://huggingface.co/datasets/tahrirchi/dilmash)
|
| 34 |
+
- **Languages:** Karakalpak, Uzbek, Russian, English
|
| 35 |
|
| 36 |
## Intended uses & limitations
|
| 37 |
|
|
|
|
| 72 |
|
| 73 |
## Training procedure
|
| 74 |
|
| 75 |
+
For full details of the training procedure, please refer to [our paper](https://arxiv.org/abs/2409.04269).
|
| 76 |
|
| 77 |
## Citation
|
| 78 |
|
| 79 |
If you use these models in your research, please cite our paper:
|
| 80 |
|
| 81 |
```bibtex
|
| 82 |
+
@misc{mamasaidov2024openlanguagedatainitiative,
|
| 83 |
+
title={Open Language Data Initiative: Advancing Low-Resource Machine Translation for Karakalpak},
|
| 84 |
+
author={Mukhammadsaid Mamasaidov and Abror Shopulatov},
|
| 85 |
+
year={2024},
|
| 86 |
+
eprint={2409.04269},
|
| 87 |
+
archivePrefix={arXiv},
|
| 88 |
+
primaryClass={cs.CL},
|
| 89 |
+
url={https://arxiv.org/abs/2409.04269},
|
| 90 |
}
|
| 91 |
```
|
| 92 |
|
|
|
|
| 100 |
- [Atabek Murtazaev](https://www.linkedin.com/in/atabek/): for advise throughout the process
|
| 101 |
- Ajiniyaz Nurniyazov: for advise throughout the process
|
| 102 |
|
| 103 |
+
We would also like to express our sincere appreciation to [Google for Startups](https://cloud.google.com/startup) for generously sponsoring the compute resources necessary for our experiments. Their support has been instrumental in advancing our research in low-resource language machine translation.
|
| 104 |
+
|
| 105 |
+
|
| 106 |
## Contacts
|
| 107 |
|
| 108 |
We believe that this work will enable and inspire all enthusiasts around the world to open the hidden beauty of low-resource languages, in particular Karakalpak.
|