YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

SpeD_ParakeetRo_110M_TDT-CTC is a Romanian Automatic Speech Recognition (ASR) model based on the FastConformer Hybrid TDT-CTC 110M architecture from NVIDIA NeMo.
The model is adapted from an English pre-trained checkpoint to Romanian through transfer learning, leveraging both speech and text data to achieve strong performance on a low-resource language.


🧠 Model Architecture

  • Base model: Parakeet Hybrid TDT-CTC 110M (NVIDIA NeMo)
  • Pre-training: Self-Supervised Learning (SSL) on LibriLight + supervised fine-tuning on 36k hours of English
  • Tokenizer: SentencePiece, 1024 BPE tokens (max subword length 5)
  • Romanian alphabet: 31 characters + hyphen (β€œ-”)

πŸ—£οΈ Romanian Adaptation

  • Training speech: 2,636 hours Romanian speech (manually and automatically annotated)
  • Text corpus: 24.6M cleaned Romanian sentences (news domain + speech transcriptions)
  • Tokenizer: Rebuilt on Romanian text corpus using SentencePiece

πŸ§ͺ Data Augmentation

  • Noise: MUSAN (6h Freesound subset), SNR 10–30, prob. 0.2
  • Speed perturbation: 0.9–1.1, prob. 0.4
  • SpecAugment + SpecCutout

βš™οΈ Training Details

This script was used for training and the base training configuration.

  • Optimizer: AdamW (lr=2.0, weight_decay=1eβˆ’3)
  • Scheduler: Noam Annealing with 10k warmup steps
  • CTC loss weight: 0.3
  • Epochs: 30
  • Batch size: 32 (grad accumulation 8)
  • Precision: BFloat16
  • Hardware: NVIDIA RTX 4090 24GB
  • Epoch time: ~5.5 hours
  • Final model: Checkpoint averaging of top 10 validation WER

Inference

In order to run your own experiment, navigate to this repository :

cd examples/asr

python3 speech_to_text_eval.py \
dataset_manifest=../../manifests/SSC-eval1_manifest.json \
model_path=... \
output_filename=... \
decoder_type=ctc \
ctc_decoding.strategy=beam \
ctc_decoding.beam.kenlm_path=... \
ctc_decoding.beam.beam_alpha=... \
ctc_decoding.beam.beam_beta=... \
ctc_decoding.beam.beam_size=...

Results

The model can be used with two different decoders and multiple decoding strategies. In order to use the model with the external N-gram model, check this model card. Values for the beam parameters used with the 6-gram tokens model are: beam_size=32, beam_alpha=0.9, beam_beta=2.

Architecture Decoding RSC-eval SSC-eval1 SSC-eval2 CDEP-eval CV-21 Fleurs-RO USPDATRO RTFx
Parakeet Ro 110M TDT (ours) Greedy 2.16 9.08 10.85 4.20 3.57 10.61 24.08 126.15
ALSD 2.05 8.64 10.88 4.17 3.38 10.16 24.30 66.63
Parakeet Ro 110M CTC (ours) Greedy 2.57 10.10 12.65 4.80 4.20 11.85 27.80 130.55
Beam Token N-gram 1.73 8.12 10.75 3.92 3.29 8.85 23.40 109.46

πŸ“„ Citation

Citation

If you use this model, please cite:

....

Also consider citing the original NVIDIA NeMo framework and KenLM:

@article{kuchaiev2019nemo,
  title={NeMo: a toolkit for building AI applications using Neural Modules},
  author={Kuchaiev, Oleksii and Ginsburg, Boris and others},
  journal={arXiv preprint arXiv:1909.09577},
  year={2019}
}

@inproceedings{heafield-2011-kenlm,
    title = "{K}en{LM}: Faster and Smaller Language Model Queries",
    author = "Heafield, Kenneth",
    editor = "Callison-Burch, Chris  and
      Koehn, Philipp  and
      Monz, Christof  and
      Zaidan, Omar F.",
    booktitle = "Proceedings of the Sixth Workshop on Statistical Machine Translation",
    month = jul,
    year = "2011",
    address = "Edinburgh, Scotland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/W11-2123/",
    pages = "187--197"
}

Contact

For questions or collaborations: [email protected]

license: apache-2.0

Downloads last month
12
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support