YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SpeD_ParakeetRo_110M_TDT-CTC is a Romanian Automatic Speech Recognition (ASR) model based on the FastConformer Hybrid TDT-CTC 110M architecture from NVIDIA NeMo.
The model is adapted from an English pre-trained checkpoint to Romanian through transfer learning, leveraging both speech and text data to achieve strong performance on a low-resource language.

🧠 Model Architecture

Base model: Parakeet Hybrid TDT-CTC 110M (NVIDIA NeMo)
Pre-training: Self-Supervised Learning (SSL) on LibriLight + supervised fine-tuning on 36k hours of English
Tokenizer: SentencePiece, 1024 BPE tokens (max subword length 5)
Romanian alphabet: 31 characters + hyphen (“-”)

🗣️ Romanian Adaptation

Training speech: 2,636 hours Romanian speech (manually and automatically annotated)
Text corpus: 24.6M cleaned Romanian sentences (news domain + speech transcriptions)
Tokenizer: Rebuilt on Romanian text corpus using SentencePiece

🧪 Data Augmentation

Noise: MUSAN (6h Freesound subset), SNR 10–30, prob. 0.2
Speed perturbation: 0.9–1.1, prob. 0.4
SpecAugment + SpecCutout

⚙️ Training Details

This script was used for training and the base training configuration.

Optimizer: AdamW (lr=2.0, weight_decay=1e−3)
Scheduler: Noam Annealing with 10k warmup steps
CTC loss weight: 0.3
Epochs: 30
Batch size: 32 (grad accumulation 8)
Precision: BFloat16
Hardware: NVIDIA RTX 4090 24GB
Epoch time: ~5.5 hours
Final model: Checkpoint averaging of top 10 validation WER

Inference

In order to run your own experiment, navigate to this repository :

cd examples/asr

python3 speech_to_text_eval.py \
dataset_manifest=../../manifests/SSC-eval1_manifest.json \
model_path=... \
output_filename=... \
decoder_type=ctc \
ctc_decoding.strategy=beam \
ctc_decoding.beam.kenlm_path=... \
ctc_decoding.beam.beam_alpha=... \
ctc_decoding.beam.beam_beta=... \
ctc_decoding.beam.beam_size=...

Results

The model can be used with two different decoders and multiple decoding strategies. In order to use the model with the external N-gram model, check this model card. Values for the beam parameters used with the 6-gram tokens model are: beam_size=32, beam_alpha=0.9, beam_beta=2.

Architecture	Decoding	RSC-eval	SSC-eval1	SSC-eval2	CDEP-eval	CV-21	Fleurs-RO	USPDATRO	RTFx
Parakeet Ro 110M TDT (ours)	Greedy	2.16	9.08	10.85	4.20	3.57	10.61	24.08	126.15
	ALSD	2.05	8.64	10.88	4.17	3.38	10.16	24.30	66.63
Parakeet Ro 110M CTC (ours)	Greedy	2.57	10.10	12.65	4.80	4.20	11.85	27.80	130.55
	Beam Token N-gram	1.73	8.12	10.75	3.92	3.29	8.85	23.40	109.46

📄 Citation

Citation

If you use this model, please cite:

....

Also consider citing the original NVIDIA NeMo framework and KenLM:

@article{kuchaiev2019nemo,
  title={NeMo: a toolkit for building AI applications using Neural Modules},
  author={Kuchaiev, Oleksii and Ginsburg, Boris and others},
  journal={arXiv preprint arXiv:1909.09577},
  year={2019}
}

@inproceedings{heafield-2011-kenlm,
    title = "{K}en{LM}: Faster and Smaller Language Model Queries",
    author = "Heafield, Kenneth",
    editor = "Callison-Burch, Chris  and
      Koehn, Philipp  and
      Monz, Christof  and
      Zaidan, Omar F.",
    booktitle = "Proceedings of the Sixth Workshop on Statistical Machine Translation",
    month = jul,
    year = "2011",
    address = "Edinburgh, Scotland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W11-2123/",
    pages = "187--197"
}

Contact

For questions or collaborations: [email protected]

license: apache-2.0

Downloads last month: 12

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support