Model Card for speech2text-intensity-regression
This repository provides a multi-task Whisper-based model that performs automatic speech recognition (ASR) and voice intensity (loudness) regression in a single forward pass. A lightweight regression head is attached to Whisperβs encoder to predict loudness in dBFS (RMS) or LUFS (per ITU-R BS.1770).
Model Details
Model Description
- Developed by: Amirhossein Yousefi (GitHub: @amirhossein-yousefi)
- Shared by : Amirhossein Yousefi
- Model type: Whisper encoderβdecoder (ASR) with an additional regression head on the encoder for loudness prediction
- Language(s) (NLP): English by default (LibriSpeech). Multilingual is supported if trained on Common Voice with the appropriate
--languagecode. - License: MIT
- Finetuned from model :
openai/whisper-small(other Whisper sizes can be used via the--model_idargument).
Whatβs in the repo
- End-to-end training and evaluation scripts (WER + intensity RMSE)
- A simple baseline intensity regressor for comparison
- A Gradio demo app for local inference
- Dockerfile and Amazon SageMaker training/inference helpers
Model Sources
- Repository: https://github.com/amirhossein-yousefi/speech2text-intensity-regression
- Demo : Local Gradio app (
app/app.py) - Sample checkpoint : See link in repository README
Uses
Direct Use
- Transcribe short-form or long-form speech while simultaneously estimating voice loudness (RMS (dBFS) or LUFS) for analytics, QA, or normalization workflows.
- Monitor audio level trends alongside transcript quality in call analytics, content moderation pipelines, or dataset curation.
Downstream Use
- Fine-tune on domain- or language-specific data (e.g., Common Voice) to adapt both transcription and loudness estimation.
- Integrate the modelβs loudness head into larger prosody or audio-quality monitoring systems.
Out-of-Scope Use
- Emotion/affect inference: Loudness is not a proxy for emotional intensity or arousal without appropriate labels and calibration.
- Legal/compliance metering: LUFS/dBFS estimates depend on microphone gain, distance, codec, and environment; do not use as a calibrated sound level meter.
- Speaker health/medical conclusions: Not designed or validated for clinical use.
Bias, Risks, and Limitations
- ASR robustness can degrade for accents, noisy conditions, reverberant rooms, or domains far from training data.
- Loudness predictions are input-chain dependent (mic gain, compression, codecs) and may not be comparable across devices without conditioning or calibration.
- LUFS vs dBFS: LUFS better correlates with perceived loudness but depends on implementation details; dBFS (RMS) is simpler but less perceptual.
Recommendations
- Calibrate and/or condition on known recording chains when comparing intensity across sessions or devices.
- Prefer LUFS targets (
--intensity_method lufs) for perceptual alignment; use RMS dBFS for simpler, robust estimates. - Evaluate on in-domain audio (compute WER and intensity RMSE) before deployment; consider domain adaptation via fine-tuning.
How to Get Started with the Model
Install (Python 3.10+; ensure a matching PyTorch+CUDA wheel if using GPU):
git clone https://github.com/amirhossein-yousefi/speech2text-intensity-regression
cd speech2text-intensity-regression
pip install -r requirements.txt
Train (example: LibriSpeech clean-100, Whisper-small):
python src/train_multitask_whisper.py --model_id openai/whisper-small --dataset librispeech --librispeech_config clean --train_split train.100 --eval_split validation --test_split test --language en --intensity_method rms --epochs 3 --batch_size 8 --grad_accum 2 --lr 1e-5 --fp16 --output_dir ./checkpoints/mtl_whisper_small
Evaluate on test:
python src/evaluate.py --ckpt ./checkpoints/mtl_whisper_small --dataset librispeech --language en --intensity_method rms
Run the local demo app:
CHECKPOINT=./checkpoints/mtl_whisper_small python app/app.py
# Open the printed Gradio URL; upload a .wav/.flac to see transcript + intensity
CLI baseline intensity regressor:
python src/baseline/baseline_intensity_regressor.py --dataset librispeech --language en --intensity rms
Training Details
Training Data
- LibriSpeech via π€ Datasets:
openslr/librispeech_asr(usecleanconfig;train.100,validation,testsplits). Intensity targets are computed directly from audio (RMS dBFS or LUFS). - Common Voice 11.0 via π€ Datasets:
mozilla-foundation/common_voice_11_0(set--language, e.g.,en,hi).
Note: For human-annotated arousal/intensity, you may adapt the code to datasets like MSP-Podcast or CREMA-D (ensure licensing).
Training Procedure
Preprocessing
- Audio resampled to 16 kHz as required by Whisper feature extractor.
- Intensity computed per clip as RMS (dBFS) or LUFS (via
pyloudnorm).
Objective
A small MLP regression head is attached to the mean-pooled encoder last hidden state. Training minimizes:
total_loss = asr_ce_loss + Ξ» * mse(intensity)
Ξ» is controlled by --lambda_intensity (default 1.0).
Training Hyperparameters
- Example:
epochs=3,batch_size=8,grad_accum=2,lr=1e-5,fp16=True(see README for more).
Speeds, Sizes, Times
- Base ASR backbone (example):
openai/whisper-small(~244M parameters). Training time depends on hardware and dataset size.
Evaluation
Testing Data, Factors & Metrics
- Testing Data: LibriSpeech test split or your in-domain test set
- Factors: Noise conditions, microphones, languages, codecs
- Metrics:
- ASR: WER (via
jiwer) - Intensity: RMSE in dBFS or LUFS
- ASR: WER (via
π Results & Metrics
π Highlights
- Test WER (β): 4.6976
- Test Intensity RMSE (β): 0.7334
- Validation WER (β): 4.6973 β’ Validation Intensity RMSE (β): 1.4492
Lower is better (β). WER computed with
jiwer. Intensity RMSE is the regression error on the loudness target (RMS dBFS by default, or LUFS if--intensity_method lufsis used).
β Full Metrics
Validation (Dev)
| Metric | Value |
|---|---|
| Loss | 2.2288 |
| WER (β) | 4.6973 |
| Intensity RMSE (β) | 1.4492 |
| Runtime (s) | 1,156.757 (β 19mβ―17s) |
| Samples / s | 2.337 |
| Steps / s | 0.292 |
| Epoch | 1 |
Test
| Metric | Value |
|---|---|
| Loss | 0.6631 |
| WER (β) | 4.6976 |
| Intensity RMSE (β) | 0.7334 |
| Runtime (s) | 1,129.272 (β 18mβ―49s) |
| Samples / s | 2.320 |
| Steps / s | 0.290 |
| Epoch | 1 |
Training Summary
| Metric | Value |
|---|---|
| Train Loss | 72.5232 |
| Runtime (s) | 6,115.966 (β 1hβ―41mβ―56s) |
| Samples / s | 4.666 |
| Steps / s | 0.292 |
| Epochs | 1 |
Raw metrics (for reproducibility)
{
"validation": {
"eval_loss": 2.228771209716797,
"eval_wer": 4.69732730414323,
"eval_intensity_rmse": 1.4492216110229492,
"eval_runtime": 1156.7567,
"eval_samples_per_second": 2.337,
"eval_steps_per_second": 0.292,
"epoch": 1.0
},
"training": {
"train_loss": 72.52319664163974,
"train_runtime": 6115.9656,
"train_samples_per_second": 4.666,
"train_steps_per_second": 0.292,
"epoch": 1.0
},
"test": {
"test_loss": 0.6630592346191406,
"test_wer": 4.69758064516129,
"test_intensity_rmse": 0.7333692312240601,
"test_runtime": 1129.2724,
"test_samples_per_second": 2.32,
"test_steps_per_second": 0.29,
"epoch": 1.0
}
}
Results
- Example logs and a sample checkpoint are referenced in the repository (
training-test-logs/and the README link). Reproduce numbers with the provided scripts for your environment.
Model Examination
- Inspect encoder activations or the regression head behavior across amplitude-normalized vs. unnormalized inputs to understand sensitivity to recording chain variations.
Environmental Impact
Carbon emissions can be estimated using the Machine Learning Impact calculator.
- Hardware Type: NVIDIA GeForce RTX 3080 Ti Laptop GPU (16 GB VRAM)
- Hours used: Not reported (varies by user setup and dataset size)
- Cloud Provider: N/A for local training; AWS SageMaker supported for cloud
Technical Specifications
Model Architecture and Objective
- Whisper encoderβdecoder (transformer) for ASR with an additional regression head on top of the mean-pooled encoder representation. Objective is ASR CE loss + λ·MSE for intensity.
Compute Infrastructure
Hardware
- Validated on a single laptop GPU (RTX 3080 Ti Laptop). SageMaker training scripts included for cloud training.
Software
- Python, PyTorch, π€ Transformers/Datasets,
jiwer,pyloudnorm, Gradio, (optional) Amazon SageMaker.
Citation
If you use this repository, please consider citing the underlying datasets and Whisper model.
BibTeX (Whisper):
@article{radford2022whisper,
title={Robust Speech Recognition via Large-Scale Weak Supervision},
author={Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal={arXiv preprint arXiv:2212.04356},
year={2022}
}
BibTeX (LibriSpeech):
@inproceedings{panayotov2015librispeech,
title={Librispeech: An {ASR} corpus based on public domain audio books},
author={Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev},
booktitle={2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
pages={5206--5210},
year={2015},
organization={IEEE}
}
BibTeX (Common Voice):
@inproceedings{ardila2020common,
title={Common Voice: A Massively-Multilingual Speech Corpus},
author={Ardila, Rosana and Branson, Megan and Davis, Kelly and Henretty, Michael and Kohler, Michael and Meyer, Josh and Morais, Reuben and Saunders, Lindsay and Tyers, Francis M. and Weber, Gregor},
booktitle={Proceedings of The 12th Language Resources and Evaluation Conference},
pages={4218--4222},
year={2020}
}
Glossary
- ASR: Automatic Speech Recognition
- WER: Word Error Rate
- dBFS: Decibels relative to full scale (digital amplitude)
- LUFS: Loudness Units relative to Full Scale (per ITU-R BS.1770)
- Regression head: Small MLP predicting continuous loudness target
More Information
- For deployment, see
sagemaker/inference/andsagemaker/train/for AWS SageMaker examples. - For local testing and UI, see
app/app.py(Gradio).
Model Card Authors
- Amirhossein Yousefi and contributors
Model Card Contact
- GitHub Issues on the repository
- Downloads last month
- 5
Model tree for Amirhossein75/speech-intensity-whisper
Base model
openai/whisper-small