ASR Models (Indic)
Collection
List of all our SOTA ASR models trained specifically for Indian languages and accents.
β’
3 items
β’
Updated
| Audio | Whisper Base | Whisper-Hindi2Hinglish-Swift |
|---|---|---|
| ΩΫΨ§ΪΊ Ψ¨Ψ³ Ψ―Ω Ω ΫΪΊ Ϊ©ΨͺΩΫ Ψ¨Ψ§Ψ± ΪΩΨͺΫ ΫΫ | vah bas din mein kitni baar chalti hai? | |
| Ψ³ΩΩ Ψ§Ω Ϊ©Ϋ Ψ§ΫΩ ΫΨͺ Ψ³Ϋ ΩΎΨ±Ψ§ΩΫΩΫΨͺ ΫΩΨͺΫ ΫΫΪΊ Ψ§Ψ³ Ϊ©Ω ΩΎΩΫ Ϊ©Ϋ Ψ³ΫΨ± Ψ¨ΪΎΨ§Ψ€ Ψ¬Ψ§ΩΫ Ϊ©ΫΨ³Ϋ | salmaan ki image se prabhaavit hote hain is company ke share bhaav jaane kaise? | |
| ΨͺΩ ΩΩΫΨ§ ΨͺΩ ΩΩΫΨ§ | vah roya aur aur roya. | |
| ΨΩΩ Ψͺ ΩΫ ΩΎΫΩΩΫ Ψ³Ϋ Ψ¨ΪΎΨ§Ψ±Ψͺ Ω ΫΪΊ ΫΨ± Ϊ―ΩΩΉΫ ΫΩΨͺΫ ΫΫ ΪΨ§Ψ± ΩΩΪ―ΩΪΊ Ϊ©Ϋ Ω ΩΨͺ | helmet na pahnne se bhaarat mein har gante hoti hai chaar logon ki maut. | |
| Ψ§ΩΨ³ΨͺΫ Ω Ψ¬ΪΎΫ ΪΩΉΪΎΫΪ©Ϋ Ψ¬ΩΨ§Ψ¨ ΩΫ Ψ―ΫΩΫ Ϊ©Ϋ ΩΫΩΉΨ§ΩΩΉΫ | usne mujhe chithi ka javaab na dene ke lie daanta. | |
| ΩΎΨ±Ψ§ΩΨ§ Ψ΄Ψ§Ϋ Ψ―ΫΩΨ§Ψ±ΩΪΊ Ψ³Ϋ Ϊ―ΫΨ±Ψ§ ΫΩΨ§ ΫΫ | puraana shahar divaaron se ghera hua hai. |
Note:
| Dataset | Whisper Base | Whisper-Hindi2Hinglish-Swift |
|---|---|---|
| Common-Voice | 106.7936 | 38.6549 |
| FLEURS | 104.2783 | 35.0888 |
| Indic-Voices | 110.8399 | 65.2147 |
pip install --upgrade transformers
pipeline
class to transcribe audios of arbitrary length:import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
# Set device (GPU if available, otherwise CPU) and precision
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
# Specify the pre-trained model ID
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
# Load the speech-to-text model with specified configurations
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
torch_dtype=torch_dtype, # Use appropriate precision (float16 for GPU, float32 for CPU)
low_cpu_mem_usage=True, # Optimize memory usage during loading
use_safetensors=True # Use safetensors format for better security
)
model.to(device) # Move model to specified device
# Load the processor for audio preprocessing and tokenization
processor = AutoProcessor.from_pretrained(model_id)
# Create speech recognition pipeline
pipe = pipeline(
"automatic-speech-recognition",
model=model,
tokenizer=processor.tokenizer,
feature_extractor=processor.feature_extractor,
torch_dtype=torch_dtype,
device=device,
generate_kwargs={
"task": "transcribe", # Set task to transcription
"language": "en" # Specify English language
}
)
# Process audio file and print transcription
sample = "sample.wav" # Input audio file path
result = pipe(sample) # Run inference
print(result["text"]) # Print transcribed text
pip install -U openai-whisper tqdm
import torch
from transformers import AutoModelForSpeechSeq2Seq
import re
from tqdm import tqdm
from collections import OrderedDict
import json
# Load parameter name mapping from HF to OpenAI format
with open('convert_hf2openai.json', 'r') as f:
reverse_translation = json.load(f)
reverse_translation = OrderedDict(reverse_translation)
def save_model(model, save_path):
def reverse_translate(current_param):
# Convert parameter names using regex patterns
for pattern, repl in reverse_translation.items():
if re.match(pattern, current_param):
return re.sub(pattern, repl, current_param)
# Extract model dimensions from config
config = model.config
model_dims = {
"n_mels": config.num_mel_bins, # Number of mel spectrogram bins
"n_vocab": config.vocab_size, # Vocabulary size
"n_audio_ctx": config.max_source_positions, # Max audio context length
"n_audio_state": config.d_model, # Audio encoder state dimension
"n_audio_head": config.encoder_attention_heads, # Audio encoder attention heads
"n_audio_layer": config.encoder_layers, # Number of audio encoder layers
"n_text_ctx": config.max_target_positions, # Max text context length
"n_text_state": config.d_model, # Text decoder state dimension
"n_text_head": config.decoder_attention_heads, # Text decoder attention heads
"n_text_layer": config.decoder_layers, # Number of text decoder layers
}
# Convert model state dict to Whisper format
original_model_state_dict = model.state_dict()
new_state_dict = {}
for key, value in tqdm(original_model_state_dict.items()):
key = key.replace("model.", "") # Remove 'model.' prefix
new_key = reverse_translate(key) # Convert parameter names
if new_key is not None:
new_state_dict[new_key] = value
# Create final model dictionary
pytorch_model = {"dims": model_dims, "model_state_dict": new_state_dict}
# Save converted model
torch.save(pytorch_model, save_path)
# Load Hugging Face model
model_id = "Oriserve/Whisper-Hindi2Hinglish-Swift"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
model_id,
low_cpu_mem_usage=True, # Optimize memory usage
use_safetensors=True # Use safetensors format
)
# Convert and save model
model_save_path = "Whisper-Hindi2Hinglish-Swift.pt"
save_model(model,model_save_path)
import whisper
# Load converted model with Whisper and transcribe
model = whisper.load_model("Whisper-Hindi2Hinglish-Swift.pt")
result = model.transcribe("sample.wav")
print(result["text"])
This model is from a family of transformers-based ASR models trained by Oriserve. To compare this model against other models from the same family or other SOTA models please head to our Speech-To-Text Arena. To learn more about our other models, and other queries regarding AI voice agents you can reach out to us at our email [email protected]
Base model
openai/whisper-base