SpeD_ParakeetRo_110M_TDT-CTC is a Romanian Automatic Speech Recognition (ASR) model based on the FastConformer Hybrid TDT-CTC 110M architecture from NVIDIA NeMo.
The model is adapted from an English pre-trained checkpoint to Romanian through transfer learning, leveraging both speech and text data to achieve strong performance on a low-resource language.
π§ Model Architecture
- Base model: Parakeet Hybrid TDT-CTC 110M (NVIDIA NeMo)
- Pre-training: Self-Supervised Learning (SSL) on LibriLight + supervised fine-tuning on 36k hours of English
- Tokenizer: SentencePiece, 1024 BPE tokens (max subword length 5)
- Romanian alphabet: 31 characters + hyphen (β-β)
π£οΈ Romanian Adaptation
- Training speech: 2,636 hours Romanian speech (manually and automatically annotated)
- Text corpus: 24.6M cleaned Romanian sentences (news domain + speech transcriptions)
- Tokenizer: Rebuilt on Romanian text corpus using SentencePiece
π§ͺ Data Augmentation
- Noise: MUSAN (6h Freesound subset), SNR 10β30, prob. 0.2
- Speed perturbation: 0.9β1.1, prob. 0.4
- SpecAugment + SpecCutout
βοΈ Training Details
This script was used for training and the base training configuration.
- Optimizer: AdamW (lr=2.0, weight_decay=1eβ3)
- Scheduler: Noam Annealing with 10k warmup steps
- CTC loss weight: 0.3
- Epochs: 30
- Batch size: 32 (grad accumulation 8)
- Precision: BFloat16
- Hardware: NVIDIA RTX 4090 24GB
- Epoch time: ~5.5 hours
- Final model: Checkpoint averaging of top 10 validation WER
Inference
In order to run your own experiment, navigate to this repository :
cd examples/asr
python3 speech_to_text_eval.py \
dataset_manifest=../../manifests/SSC-eval1_manifest.json \
model_path=... \
output_filename=... \
decoder_type=ctc \
ctc_decoding.strategy=beam \
ctc_decoding.beam.kenlm_path=... \
ctc_decoding.beam.beam_alpha=... \
ctc_decoding.beam.beam_beta=... \
ctc_decoding.beam.beam_size=...
Results
The model can be used with two different decoders and multiple decoding strategies. In order to use the model with the external N-gram model, check this model card. Values for the beam parameters used with the 6-gram tokens model are: beam_size=32, beam_alpha=0.9, beam_beta=2.
| Architecture | Decoding | RSC-eval | SSC-eval1 | SSC-eval2 | CDEP-eval | CV-21 | Fleurs-RO | USPDATRO | RTFx |
|---|---|---|---|---|---|---|---|---|---|
| Parakeet Ro 110M TDT (ours) | Greedy | 2.16 | 9.08 | 10.85 | 4.20 | 3.57 | 10.61 | 24.08 | 126.15 |
| ALSD | 2.05 | 8.64 | 10.88 | 4.17 | 3.38 | 10.16 | 24.30 | 66.63 | |
| Parakeet Ro 110M CTC (ours) | Greedy | 2.57 | 10.10 | 12.65 | 4.80 | 4.20 | 11.85 | 27.80 | 130.55 |
| Beam Token N-gram | 1.73 | 8.12 | 10.75 | 3.92 | 3.29 | 8.85 | 23.40 | 109.46 |
π Citation
Citation
If you use this model, please cite:
....
Also consider citing the original NVIDIA NeMo framework and KenLM:
@article{kuchaiev2019nemo,
title={NeMo: a toolkit for building AI applications using Neural Modules},
author={Kuchaiev, Oleksii and Ginsburg, Boris and others},
journal={arXiv preprint arXiv:1909.09577},
year={2019}
}
@inproceedings{heafield-2011-kenlm,
title = "{K}en{LM}: Faster and Smaller Language Model Queries",
author = "Heafield, Kenneth",
editor = "Callison-Burch, Chris and
Koehn, Philipp and
Monz, Christof and
Zaidan, Omar F.",
booktitle = "Proceedings of the Sixth Workshop on Statistical Machine Translation",
month = jul,
year = "2011",
address = "Edinburgh, Scotland",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/W11-2123/",
pages = "187--197"
}
Contact
For questions or collaborations: [email protected]
license: apache-2.0
- Downloads last month
- 12