Whisper Hebrish: Whisper Large (Turbo V3) Fine-Tuned For English-Hebrew Immigrant Speech Patterns ("Hebrish")

Try it out

Try the interactive demo on Hugging Face Spaces

ASR For Mixed Speech Patterns

Many immigrant groups at various stages of absorption into non-English-speaking societies adopt a unique linguistic hybrid: their native tongue peppered with liberal dashes of their second language.

I was born in Ireland and immigrated to Israel 10 years ago. Latterly, I have become a passionate user of AI tools - especially speech to text (STT) and automatic speech recognition (ASR).

Over the course of a year spent transcribing everything from grocery lists to blog outlines (mostly using Whisper or variations of it), I have noticed an obvious pattern: while Whisper is a superlatively good speech recognition model, the majority of Hebrew words that English immigrants might employ in daily or regular speech ("teudat zehut - ID card; mazgan - air conditioner) are not English nor are they sufficiently well-known (contrast: Shabbat, Torah) that they are present in corpora ingested into ASR training sets.

The result: the ASR attempts to transcribe phonetically. Results vary between comical and plain intelligible.

I recently created a personal fine-tune of Whisper.

While I had the notebook code handy, I thought it would be worth seeing if I could fine-tune Whisper for this purpose, which is related to one of the most important use-cases for ASR fine-tuning: tuning ASR models which are inherently multilingual on underrepresented languages.

Example

OpenAI Whisper Large (V3, Turbo) vs. Fine Tune head to head.

Demo with two words in dataset: makolet (minimarket) and teudat zehut (ID card):

TRUTH:

I went to the makolet today to pick up some bread, and I also got my teudat zehut.

FINE-TUNE:

I went to the makolet today to pick up some bread, and I also got my teudat zehut.

STOCK WHISPER:

 I went to the Macaulay today to pick up some bread and I also got my Theodette Sahoot.

Methodology

I used Claude Code to generate a list of 500 Hebrew words which it believed English speakers may use in daily speech. I recorded a subset of these and added my own as they came to mind.

I recorded three variations of each word in an attempt to buttress the reliability of the fine-tune. Where variations in pronunciation exist for common words, I recorded each variant.

The dataset that this model was trained on preserves the original audio files and the ground truths - the latter in the JSONL.

Performance & WER Metrics

The training script, written by Claude Code, was based upon this excellent template provided by Modal.

I used an A100 for the training run which ran across 10 epochs and lasted approximately 30 minutes.

WER Improvement

Metric	Value
Baseline WER (Pre-training)	16.79%
Post-training WER	6.07%
Improvement	63.8% reduction

Baseline Performance:

Post-Training Performance:

Fine-tuning the Whisper Large V3 Turbo model on English-Hebrew code-switched data resulted in a 63.8% reduction in WER, demonstrating significant improvement in transcribing mixed-language speech.

Intended Uses & Limitations

This dataset and model were both intended as POCs - although I have been pleasantly surprised by the results of the demo. I chose Whisper Turbo Large V3 for the base model as it is an OpenAI finetune and, in my own benchmarks, has provided the best performance.

Although the objective was to explore training on a small mixed language dataset as a means of working around some of the deficiencies of rigid language-specific paramaterisation, I also believe that a non-POC implementation of this approach could have significant benefits for similar linguistic groups. The approach of using a fine tune offset against a major/standard model could achieve measureable improvements in accuracy in commonly encountered use-cases.

Training Procedure

The model was trained for 10 epochs on the English-Hebrew mixed sentences dataset, achieving a final WER of 6.07%.

Key Training Details:

Base Model: OpenAI Whisper Large V3 Turbo
Learning Rate: 1e-05
Batch Size: 32 (effective)
Epochs: 10
Mixed Precision: Native AMP
Framework: Transformers 4.44.0, PyTorch 2.4.0+cu121

Author & License

Created by: Daniel Rosehill Website: danielrosehill.com License: MIT

Feedback

Feel free to share feedback and reports! This is an experimental model and community input is valuable for understanding its strengths and limitations in real-world code-switching scenarios.