NB/WARNING: I AM AWARE OF THE INFERENCE CHALLENGES, WILL FIND SOME TIME TO WORK ON THIS. PLEASE CHECK THIS MODEL adoamesh/whisper-small-swh-finetuned ALTERNATIVE. NB IT IS NOT QUANTIZED.

On Windows, _empty_affine_quantized is not implemented, so it fails immediately. Try using UNIX/LINUX

Model Card for Whisper Tiny Swahili Distilled 8-bit (CPU Support)

Model Description

This model is a quantized version of a distilled Whisper Tiny model specifically trained for Swahili speech recognition. The model has been compressed using dynamic quantization (int8) to reduce its size while maintaining performance.

  • Model Type: Automatic Speech Recognition (ASR)
  • Language: Swahili
  • Base Model: OpenAI Whisper Tiny
  • Quantization: Dynamic Quantization (int8) only works on CPU
  • Distillation: Knowledge distillation from Whisper Small

Intended Uses & Limitations

Intended Uses

  • Transcription of Swahili audio to text
  • Speech-to-text applications for Swahili language
  • Research in low-resource language speech recognition

Limitations

  • May perform poorly on dialects or accents not well-represented in training data
  • May struggle with technical terminology or domain-specific vocabulary
  • Performance may degrade with noisy audio recordings
  • Quantization may introduce slight degradation in transcription quality compared to the full-precision model

Training Data

The model was trained on Swahili speech data from the Fleurs-SLU dataset (check citations below):

  • Training set: 3.62 hours of Swahili audio
  • Test set: 0.60 hours of Swahili audio
  • Total dataset: 4.22 hours of Swahili audio

Training Procedure

Knowledge Distillation

The model was trained using knowledge distillation with a Whisper Small model as the teacher and Whisper Tiny as the student. The distillation process combines:

  • Soft targets from the teacher model (temperature-scaled logits)
  • Hard targets from the ground truth transcriptions

Quantization

After distillation, the model was quantized using PyTorch's dynamic quantization:

  • Linear layers were quantized to int8
  • Quantization was applied post-training (PTQ)
  • Model size was reduced while maintaining most of the performance

Evaluation Results

Distilled Model Performance

The distilled model was evaluated over 1000 training steps with the following metrics:

Step Training Loss Validation Loss WER (%)
250 0.120300 0.711645 37.0963
500 0.014100 0.770001 33.0668
750 0.001000 0.792062 32.2354
1000 0.000700 0.803786 32.0754
------ --------------- ----------------- ---------

[ 796/1000 56:40 < 14:33, 0.23 it/s, Epoch 15.59/20] Step Training Loss Validation Loss Wer 250 0.120300 500 37 750 0.001000 0.792062 32.9

Comparison with Original Model

Model WER (%) Improvement
Original Whisper-Small (OpenAI) 103.10 -
Fine-tuned Whisper-Small (Swahili) 32.07 68% improvement (dropped by 70)

{'eval_loss': 0.8037856817245483, 'eval_wer': 32.075471698113205, 'eval_runtime': 74.3925, 'eval_samples_per_second': 2.07, 'eval_steps_per_second': 0.269, 'epoch': 19.607843137254903}

Training Statistics

  • Training runtime: 8323.69 seconds
  • Training samples per second: 3.844
  • Training steps per second: 0.24
  • Total FLOPs: 9.17e+18
  • Epochs: 39.22

Evaluation Statistics

  • Evaluation loss: 0.8844
  • Evaluation WER: 32.1394%
  • Evaluation runtime: 73.19 seconds
  • Evaluation samples per second: 2.104
  • Evaluation steps per second: 0.273

Comparison with Example Transcription

Swahili Whisper Speech-to-Text Transcription

Code Implementation

import torch
import torchaudio
from transformers import WhisperProcessor, WhisperForConditionalGeneration

# --- 1️⃣ Load processor (shared between models) ---
processor = WhisperProcessor.from_pretrained("openai/whisper-small")

# --- 2️⃣ Load models ---
# Fine-tuned Swahili Whisper-Small
finetuned_model_path = "./Ex02/whisper-small-swh/checkpoint-1000"
finetuned_model = WhisperForConditionalGeneration.from_pretrained(finetuned_model_path)
finetuned_model.generation_config.language = "swahili"
finetuned_model.generation_config.task = "transcribe"
finetuned_model.eval()

# Original OpenAI Whisper-Small
original_model = WhisperForConditionalGeneration.from_pretrained("openai/whisper-small")
original_model.generation_config.language = "swahili"
original_model.generation_config.task = "transcribe"
original_model.eval()

# --- 3️⃣ Load audio file ---
audio_path = "./Ex02/Recording.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample if not 16kHz
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(sample_rate, 16000)
    waveform = resampler(waveform)

# Prepare input features
input_features = processor(
    waveform.numpy()[0],
    sampling_rate=16000,
    return_tensors="pt"
).input_features

# --- 4️⃣ Transcribe ---
with torch.no_grad():
    # Fine-tuned
    predicted_ids_finetuned = finetuned_model.generate(input_features)
    transcription_finetuned = processor.batch_decode(predicted_ids_finetuned, skip_special_tokens=True)[0]

    # Original
    predicted_ids_original = original_model.generate(input_features)
    transcription_original = processor.batch_decode(predicted_ids_original, skip_special_tokens=True)[0]

# --- 5️⃣ Print results ---
print("\n=== Transcriptions ===")
print(f"Fine-tuned Swahili Whisper-Small: {transcription_finetuned}")
print(f"Original Whisper-Small (OpenAI): {transcription_original}")


## How to Use

```python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import torchaudio
from huggingface_hub import login

# Optional: login if model is private, i.e Error 401s
login(token="hf_somekey")

# ============================================================
# 1️⃣ Load Processor and Quantized Model
# ============================================================
MODEL_ID = "adoamesh/whisper-tiny-swahili-distilled-8bit"

processor = WhisperProcessor.from_pretrained(MODEL_ID)
model = WhisperForConditionalGeneration.from_pretrained(MODEL_ID)
model.eval()

# Force CPU execution (quantized model works best on CPU)
device = torch.device("cpu")
model.to(device)

# ============================================================
# 2️⃣ Load and Preprocess Audio
# ============================================================
audio_path = "Recording.wav"
waveform, sample_rate = torchaudio.load(audio_path)

# Resample to 16 kHz if necessary
if sample_rate != 16000:
    resampler = torchaudio.transforms.Resample(orig_freq=sample_rate, new_freq=16000)
    waveform = resampler(waveform)

# Prepare input features
input_features = processor(
    waveform.squeeze().numpy(),  # remove channel dim if present
    sampling_rate=16000,
    return_tensors="pt"
).input_features.to(device)

# ============================================================
# 3️⃣ Generate Transcription
# ============================================================
with torch.no_grad():
    predicted_ids = model.generate(input_features)

# Decode text
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(f"\n🗣️  Transcription:\n{transcription}\n")
Results Output ranked (Human Eval)

=== Transcriptions ===

  • Fine-tuned Swahili Whisper-Small: Ni limwambia yulebindi kwamba na mpenda. 1️⃣
  • Original Whisper-Small (OpenAI): Nili mwabi aiyule bindi kwa mba nampe dha. 2️⃣

Model Architecture

The model follows the Whisper Tiny architecture with:

  • Encoder-decoder transformer structure
  • 6 encoder layers and 6 decoder layers
  • Model dimension: 384
  • Feed-forward dimension: 1536
  • 6 attention heads

Hardware Requirements

  • CPU: Any modern CPU
  • RAM: Minimum 2GB
  • Storage: ~150MB for the model

Ethical Considerations

  • The model should not be used for surveillance or discriminatory purposes
  • Users should be aware of potential biases in the training data
  • The model may not perform equally well across all Swahili dialects

Citation

If you use this model, please cite:

@misc{whisper-tiny-swahili-distilled-8bit,
  author = {Daniel Amemba Odhiambo},
  title = {Whisper Tiny Swahili Distilled 8-bit},
  year = {2025},
  publisher = {Hugging Face},
  journal = {Hugging Face Model Hub},
  howpublished = {\url{https://huggingface.co/adoamesh/whisper-tiny-swahili-distilled-8bit}}

My Citations

#### Fleurs-SLU for the Swahili speech data

@misc{schmidt2025fleursslumassivelymultilingualbenchmark,
      title={Fleurs-SLU: A Massively Multilingual Benchmark for Spoken Language Understanding}, 
      author={Fabian David Schmidt and Ivan Vulić and Goran Glavaš and David Ifeoluwa Adelani},
      year={2025},
      eprint={2501.06117},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2501.06117}, 
}

@misc{adelani2023sib200,
      title={SIB-200: A Simple, Inclusive, and Big Evaluation Dataset for Topic Classification in 200+ Languages and Dialects}, 
      author={David Ifeoluwa Adelani and Hannah Liu and Xiaoyu Shen and Nikita Vassilyev and Jesujoba O. Alabi and Yanke Mao and Haonan Gao and Annie En-Shiun Lee},
      year={2023},
      eprint={2309.07445},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}


}

Acknowledgments

  • OpenAI for the original Whisper model
  • Fleurs-SLU for the Swahili speech data
  • Hugging Face for the Transformers library and model hosting

Contact Author on LinkedIn


Downloads last month
145
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for adoamesh/whisper-tiny-swahili-distilled-8bit

Finetuned
(3010)
this model