Qwen3-Omni-30B-A3B-Instruct-4bit-MLX

This is a 4-bit quantized version of Qwen3-Omni-30B-A3B-Instruct converted for Apple MLX.

This implementation supports text-only inference. Audio understanding is possible through a hybrid approach (see below).

I am opening a PR HERE for mlx-lm to support qwen3-omni-moe, if this doesn't work currently then check the status of this PR and wait until it's merged into mlx-lm AND mlx-omni to update your mlx-lm or mlx-omni libraries OR patch it in yourself (all it takes is copy pasting the file locally to the right spot).

Quick Start

MLX Library

from mlx_lm import load, generate

model, tokenizer = load("pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx")

messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)

MLX Omni Server OpenAI API

from openai import OpenAI
client = OpenAI(
    base_url="http://localhost:10240/v1",
    api_key="not-needed"
)

response = client.chat.completions.create(
    model="pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx",
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)

Command Line

mlx_lm.generate --model pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx \
  --prompt "Explain quantum computing"

Audio Understanding (Hybrid PyTorch + MLX)

While this MLX model only supports text inference, you can achieve audio understanding by combining it with the PyTorch audio encoder through a really hacky method where we rip the audio tower out of the base model:

import torch
import mlx.core as mx
from mlx_lm import load
from mlx_lm.generate import generate_step
from mlx_lm.sample_utils import make_sampler
from transformers import Qwen3OmniMoeProcessor, AutoConfig
from qwen_omni_utils import process_mm_info
import numpy as np

from transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe import Qwen3OmniMoeAudioEncoder
from huggingface_hub import snapshot_download
from openai import OpenAI
import glob
import os
from safetensors import safe_open

MLX_MODEL_PATH = "pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx"
HF_MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"

# Make a dummy audio file for testing
input_str = """
The sky above the port was the color of television, tuned to a dead channel. 
"It's not like I'm using," Case heard someone say, as he shouldered his way 
through the crowd around the door of the Chat. "It's like my body's developed 
this massive drug deficiency." It was a Sprawl voice and a Sprawl joke.
"""

client = OpenAI(
    base_url="http://localhost:10240/v1",
    api_key="not-needed"
)
response = client.audio.speech.create(
    model="mlx-community/Kokoro-82M-4bit",
    voice="af_sky",
    input=input_str,
)
response.stream_to_file("neuro_output.wav")

### Process audio → embeddings → MLX generation (so begins the jank) ###

print("Loading processor...")
processor = Qwen3OmniMoeProcessor.from_pretrained(HF_MODEL_PATH)

print("Loading audio_tower...")

config = AutoConfig.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
audio_config = config.thinker_config.audio_config
audio_tower = Qwen3OmniMoeAudioEncoder(audio_config)
audio_tower.eval()

# Load audio_tower weights
model_path = snapshot_download(HF_MODEL_PATH, allow_patterns=["*.safetensors", "*.json"])
safetensor_files = sorted(glob.glob(os.path.join(model_path, "*.safetensors")))
audio_tower_weights = {}
for st_file in safetensor_files:
    with safe_open(st_file, framework="pt") as f:
        for key in f.keys():
            if key.startswith("thinker.audio_tower."):
                new_key = key.replace("thinker.audio_tower.", "")
                audio_tower_weights[new_key] = f.get_tensor(key)
audio_tower.load_state_dict(audio_tower_weights, strict=False)

print("Loading MLX language model...")
model, tokenizer = load(MLX_MODEL_PATH)

# Function to process audio and generate response
def understand_audio(audio_path, question):
    """Process audio file and answer questions about it"""
    
    # Prepare conversation
    conversation = [{"role": "user", "content": [
        {"type": "audio", "audio": audio_path},
        {"type": "text", "text": question}
    ]}]
    
    # Process inputs
    text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
    audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
    inputs = processor(text=text_prompt, audio=audios, images=images, videos=videos, 
                       return_tensors="pt", padding=True, use_audio_in_video=False)
    
    # Process audio through audio_tower
    with torch.no_grad():
        audio_features = inputs['input_features'].squeeze(0)
        feature_lens = inputs['feature_attention_mask'].sum(dim=1)
        audio_outputs = audio_tower(audio_features, feature_lens=feature_lens)
        audio_embeddings = audio_outputs.last_hidden_state
    
    # Merge text and audio embeddings
    audio_token_id = config.thinker_config.audio_token_id
    input_ids_np = inputs["input_ids"][0].numpy()
    audio_positions = np.where(input_ids_np == audio_token_id)[0]
    
    embed_layer = model.language_model.model.embed_tokens
    all_embeddings = embed_layer(mx.array(input_ids_np))
    audio_embeddings_mlx = mx.array(audio_embeddings.cpu().numpy())
    
    # Replace audio tokens with audio embeddings
    segments = []
    last_pos = 0
    for i, pos in enumerate(audio_positions):
        pos = int(pos)
        if pos > last_pos:
            segments.append(all_embeddings[last_pos:pos])
        segments.append(audio_embeddings_mlx[i:i+1])
        last_pos = pos + 1
    if last_pos < all_embeddings.shape[0]:
        segments.append(all_embeddings[last_pos:])
    merged_embeddings = mx.concatenate(segments, axis=0)
    
    # Generate response
    sampler = make_sampler(temp=0.7, top_p=0.9)
    dummy_prompt = mx.zeros((merged_embeddings.shape[0],), dtype=mx.int32)
    
    eos_token_id = tokenizer.eos_token_id
    im_end_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
    stop_tokens = {eos_token_id, im_end_id}
    
    tokens = []
    for (token, _), n in zip(generate_step(prompt=dummy_prompt, model=model, 
                                           max_tokens=100, sampler=sampler,
                                           input_embeddings=merged_embeddings),
                             range(100)):
        token_id = token if isinstance(token, int) else token.item()
        tokens.append(token_id)
        if token_id in stop_tokens:
            break
    
    response = tokenizer.decode(tokens).replace("<|im_end|>", "").replace("<|endoftext|>", "").strip()
    return response

# Models will stay loaded and you can input new audio files and prompts here
response = understand_audio("neuro_output.wav", "Summarize this audio clip.")
print(response)

Model Details

Property Value
Base Model Qwen3-Omni-30B-A3B-Instruct
Quantization 4-bit (group_size=64, bits=4)
Framework MLX / Apple Silicon
Components Text-only (thinker component)

Implementation Notes

This model is a wrapper around the qwen3_moe architecture that:

  • Loads only the text language model weights
  • Filters out multimodal components (audio_tower, talker, code2wav, visual)
  • Supports input_embeddings parameter for hybrid multimodal use cases

License

Apache 2.0 (same as base model)

Acknowledgments

  • Qwen Team for the original model
  • MLX team for the framework
Downloads last month
196
Safetensors
Model size
31B params
Tensor type
BF16
·
U32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx

Quantized
(4)
this model