facebook
/

hf-seamless-m4t-medium

+---
+inference: false
+tags:
+- SeamlessM4T
+- seamless_m4t
+license: cc-by-nc-4.0
+library_name: transformers
+---
+# SeamlessM4T Medium
+SeamlessM4T is a collection of models designed to provide high quality translation, allowing people from different
+linguistic communities to communicate effortlessly through speech and text.
+SeamlessM4T covers:
+- 📥 101 languages for speech input
+- ⌨️ 96 Languages for text input/output
+- 🗣️ 35 languages for speech output.
+This is the "medium" variant of the unified model, which enables multiple tasks without relying on multiple separate models:
+- Speech-to-speech translation (S2ST)
+- Speech-to-text translation (S2TT)
+- Text-to-speech translation (T2ST)
+- Text-to-text translation (T2TT)
+- Automatic speech recognition (ASR)
+You can perform all the above tasks from one single model - `SeamlessM4TModel`, but each task also has its own dedicated sub-model.
+## Usage
+First, load the processor and a checkpoint of the model:
+```python
+>>> from transformers import AutoProcessor, SeamlessM4TModel
+>>> processor = AutoProcessor.from_pretrained("ylacombe/hf-seamless-m4t-medium")
+>>> model = SeamlessM4TModel.from_pretrained("ylacombe/hf-seamless-m4t-medium")
+```
+You can seamlessly use this model on text or on audio, to generated either translated text or translated audio.
+### Speech
+You can easily generate translated speech with [`SeamlessM4TModel.generate`]. Here is an example showing how to generate speech from English to Russian.
+```python
+>>> inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
+>>> audio_array = model.generate(**inputs, tgt_lang="rus")
+>>> audio_array = audio_array[0].cpu().numpy().squeeze()
+```
+You can also translate directly from a speech waveform. Here is an example from Arabic to English:
+```python
+>>> from datasets import load_dataset
+>>> dataset = load_dataset("arabic_speech_corpus", split="test[0:1]")
+>>> audio_sample = dataset["audio"][0]["array"]
+>>> inputs = processor(audios = audio_sample, return_tensors="pt")
+>>> audio_array = model.generate(**inputs, tgt_lang="rus")
+>>> audio_array = audio_array[0].cpu().numpy().squeeze()
+```
+#### Tips
+[`SeamlessM4TModel`] is transformers top level model to generate speech and text, but you can also use dedicated models that perform the task without additional components, thus reducing the memory footprint.
+For example, you can replace the previous snippet with the model dedicated to the S2ST task:
+```python
+>>> from transformers import SeamlessM4TForSpeechToSpeech
+>>> model = SeamlessM4TForSpeechToSpeech.from_pretrained("ylacombe/hf-seamless-m4t-medium")
+```
+### Text
+Similarly, you can generate translated text from text or audio files, this time using the dedicated models.
+```python
+>>> from transformers import SeamlessM4TForSpeechToText
+>>> model = SeamlessM4TForSpeechToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
+>>> audio_sample = dataset["audio"][0]["array"]
+>>> inputs = processor(audios = audio_sample, return_tensors="pt")
+>>> output_tokens = model.generate(**inputs, tgt_lang="fra")
+>>> translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
+```
+And from text:
+```python
+>>> from transformers import SeamlessM4TForTextToText
+>>> model = SeamlessM4TForTextToText.from_pretrained("ylacombe/hf-seamless-m4t-medium")
+>>> inputs = processor(text = "Hello, my dog is cute", src_lang="eng", return_tensors="pt")
+>>> output_tokens = model.generate(**inputs, tgt_lang="fra")
+>>> translated_text = processor.decode(output_tokens.tolist()[0], skip_special_tokens=True)
+```
+#### Tips
+Three last tips:
+1. [`SeamlessM4TModel`] can generate text and/or speech. Pass `generate_speech=False` to [`SeamlessM4TModel.generate`] to only generate text. You also have the possibility to pass `return_intermediate_token_ids=True`, to get both text token ids and the generated speech.
+2. You have the possibility to change the speaker used for speech synthesis with the `spkr_id` argument.
+3. You can use different [generation strategies](./generation_strategies) for speech and text generation, e.g `.generate(input_ids=input_ids, text_num_beams=4, speech_do_sample=True)` which will successively perform beam-search decoding on the text model, and multinomial sampling on the speech model.