--- tags: - transformers - llama - trl - orpeheutts - tts - Texttospeech license: apache-2.0 language: - es datasets: - sirekist98/spanish_tts_noauddataset_24khz base_model: - canopylabs/3b-es_it-pretrain-research_release pipeline_tag: text-to-speech --- # Spanish TTS Model with Emotions and Multiple Voices This repository contains a fine-tuned Spanish Text-to-Speech (TTS) model based on [`canopylabs/3b-es_it-pretrain-research_release`](https://huggingface.co/canopylabs/3b-es_it-pretrain-research_release). The model supports multiple voices and nuanced emotions, trained using [Unsloth](https://github.com/unslothai/unsloth) and [SNAC](https://huggingface.co/hubertsiuzdak/snac_24khz) for audio tokenization. ➑️ **Try it online**: [https://huggingface.co/spaces/sirekist98/orpheustts\_spanish\_tuned](https://huggingface.co/spaces/sirekist98/orpheustts_spanish_tuned) --- ## πŸ‘¨β€πŸ’» Model Summary * **Base model**: `canopylabs/3b-es_it-pretrain-research_release` * **Fine-tuned with**: LoRA adapters (64 rank, alpha 64) * **Audio tokenization**: SNAC (24kHz) * **Input format**: `source (emotion): text` * **Dataset**: \~109k samples, 11 emotions Γ— 11 speakers * **Training framework**: Unsloth + Hugging Face Transformers --- ## πŸš€ Training Overview The model was trained on a curated subset of the dataset [`sirekist98/spanish_tts_noauddataset_24khz`](https://huggingface.co/datasets/sirekist98/spanish_tts_noauddataset_24khz). We selected combinations of speaker (`source`) and `emotion` with at least 1000 samples, resulting in a balanced dataset of over 109,000 examples. Each sample was tokenized using SNAC and embedded in a prompt structured as: ```text source (emotion): text ``` This prompt was then used to generate audio tokens, enabling the model to learn nuanced emotional prosody and voice control. We trained the model for 1 epoch using gradient accumulation (batch size 8 Γ— 4 steps) with 4-bit quantization on an NVIDIA L4 GPU. --- ## πŸ”Š Inference You can run inference using the demo space: [Orpheus TTS Spanish Fine-Tuned](https://huggingface.co/spaces/sirekist98/orpheustts_spanish_tuned). To run inference locally with full control: ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM from peft import PeftModel from snac import SNAC # --- Minimal config --- device = torch.device("cuda" if torch.cuda.is_available() else "cpu") BASE = "canopylabs/3b-es_it-pretrain-research_release" LORA = "sirekist98/spanish_tts_emotions" SNAC_ID = "hubertsiuzdak/snac_24khz" VOICE = "alloy" EMOTION_ID = "intense_fear_dread_apprehension_horror_terror_panic" TEXT = "Estoy atrapado, por favor ayΓΊdame." prompt = f"{VOICE} ({EMOTION_ID}): {TEXT}" # --- Load models --- tokenizer = AutoTokenizer.from_pretrained(BASE) base_model = AutoModelForCausalLM.from_pretrained( BASE, torch_dtype=torch.float16 if device.type == "cuda" else torch.float32 ) model = PeftModel.from_pretrained(base_model, LORA).to(device).eval() snac_model = SNAC.from_pretrained(SNAC_ID).to(device) # --- Prepare input (same as your Space) --- input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device) start_tok = torch.tensor([[128259]], dtype=torch.long).to(device) end_toks = torch.tensor([[128009, 128260]], dtype=torch.long).to(device) input_ids = torch.cat([start_tok, input_ids, end_toks], dim=1) MAX_LEN = 4260 pad_len = MAX_LEN - input_ids.shape[1] pad = torch.full((1, pad_len), 128263, dtype=torch.long).to(device) input_ids = torch.cat([pad, input_ids], dim=1) attention_mask = torch.cat( [torch.zeros((1, pad_len), dtype=torch.long), torch.ones((1, input_ids.shape[1] - pad_len), dtype=torch.long)], dim=1 ).to(device) # --- Generate --- generated = model.generate( input_ids=input_ids, attention_mask=attention_mask, max_new_tokens=1200, do_sample=True, temperature=0.6, top_p=0.95, repetition_penalty=1.1, num_return_sequences=1, eos_token_id=128258, use_cache=True ) # --- Post-process (find 128257, remove 128258, multiple of 7, subtract 128266) --- AUDIO_TOKEN_OFFSET = 128266 token_to_find = 128257 token_to_remove = 128258 idxs = (generated == token_to_find).nonzero(as_tuple=True) cropped = generated[:, idxs[1][-1].item() + 1:] if len(idxs[1]) > 0 else generated cleaned = cropped[cropped != token_to_remove] codes = cleaned[: (len(cleaned) // 7) * 7].tolist() codes = [int(t) - AUDIO_TOKEN_OFFSET for t in codes] # --- SNAC decode (same layout as your Space) --- layer_1, layer_2, layer_3 = [], [], [] for i in range((len(codes) + 1) // 7): b = 7 * i if b + 6 >= len(codes): break layer_1.append(codes[b + 0]) layer_2.append(codes[b + 1] - 4096) layer_3.append(codes[b + 2] - 2 * 4096) layer_3.append(codes[b + 3] - 3 * 4096) layer_2.append(codes[b + 4] - 4 * 4096) layer_3.append(codes[b + 5] - 5 * 4096) layer_3.append(codes[b + 6] - 6 * 4096) dev_snac = snac_model.quantizer.quantizers[0].codebook.weight.device layers = [ torch.tensor(layer_1).unsqueeze(0).to(dev_snac), torch.tensor(layer_2).unsqueeze(0).to(dev_snac), torch.tensor(layer_3).unsqueeze(0).to(dev_snac), ] with torch.no_grad(): audio = snac_model.decode(layers).squeeze().cpu().numpy() # 'audio' is the 24kHz waveform. # Optional: # from scipy.io.wavfile import write as write_wav # write_wav("output.wav", 24000, audio) ``` --- ## πŸ—£οΈ Available Voices You can generate speech using the following voices (`source`): ``` alloy, ash, ballad, coral, echo, fable, nova, onyx, sage, shimmer, verse ``` ## 🌧️ Available Emotions for each voice --- ## alloy * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_confusion\_bewilderment\_disorientation\_and\_perplexity * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sourness\_tartness\_and\_acidity * intense\_sympathy\_compassion\_warmth\_trust\_and\_tenderness ## ash * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_sympathy\_compassion\_warmth\_trust\_and\_tenderness ## ballad * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_confusion\_bewilderment\_disorientation\_and\_perplexity * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sourness\_tartness\_and\_acidity ## coral * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_confusion\_bewilderment\_disorientation\_and\_perplexity * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sourness\_tartness\_and\_acidity * intense\_sympathy\_compassion\_warmth\_trust\_and\_tenderness ## echo * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sympathy\_compassion\_warmth\_trust\_and\_tenderness ## fable * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_sourness\_tartness\_and\_acidity ## nova * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_confusion\_bewilderment\_disorientation\_and\_perplexity * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sourness\_tartness\_and\_acidity * intense\_sympathy\_compassion\_warmth\_trust\_and\_tenderness ## onyx * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_confusion\_bewilderment\_disorientation\_and\_perplexity * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sympathy\_compassion\_warmth\_trust\_and\_tenderness ## sage * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_confusion\_bewilderment\_disorientation\_and\_perplexity * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sourness\_tartness\_and\_acidity * intense\_sympathy\_compassion\_warmth\_trust\_and\_tenderness ## shimmer * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_confusion\_bewilderment\_disorientation\_and\_perplexity * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_pride\_dignity\_self\_confidence\_and\_honor * intense\_sourness\_tartness\_and\_acidity ## verse * intense\_interest\_fascination\_curiosity\_and\_intrigue * intense\_fear\_dread\_apprehension\_and\_horror * intense\_ecstasy\_pleasure\_bliss\_rapture\_and\_beatitude * intense\_numbness\_detachment\_insensitivity\_and\_apathy * intense\_contempt\_disdain\_loathing\_and\_detestation * intense\_astonishment\_surprise\_amazement\_and\_shock * intense\_helplessness\_powerlessness\_desperation\_and\_submission * intense\_sourness\_tartness\_and\_acidity --- ## πŸ“– Citation ```bibtex @misc{sirekist2025spanishTTS, author = {sirekist98}, title = {Spanish TTS Model with Emotions and Multiple Voices}, year = {2025}, howpublished = {\url{https://huggingface.co/sirekist98/spanish_model}} } ``` --- ## ✨ Acknowledgements * [Unsloth](https://github.com/unslothai/unsloth) * [SNAC](https://huggingface.co/hubertsiuzdak/snac_24khz) * [Hugging Face Datasets and Spaces](https://huggingface.co/) --- ## ❓ Questions or Contributions? Open an issue or contact [@sirekist98](https://huggingface.co/sirekist98) on Hugging Face. Thanks for checking out this model! πŸš€