Qwen3-Omni-30B-A3B-Instruct-4bit-MLX
This is a 4-bit quantized version of Qwen3-Omni-30B-A3B-Instruct converted for Apple MLX.
This implementation supports text-only inference. Audio understanding is possible through a hybrid approach (see below).
I am opening a PR HERE for mlx-lm to support qwen3-omni-moe, if this doesn't work currently then check the status of this PR and wait until it's merged into mlx-lm AND mlx-omni to update your mlx-lm or mlx-omni libraries OR patch it in yourself (all it takes is copy pasting the file locally to the right spot).
Quick Start
MLX Library
from mlx_lm import load, generate
model, tokenizer = load("pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx")
messages = [{"role": "user", "content": "What is machine learning?"}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
response = generate(model, tokenizer, prompt=prompt, max_tokens=100)
print(response)
MLX Omni Server OpenAI API
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:10240/v1",
api_key="not-needed"
)
response = client.chat.completions.create(
model="pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx",
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.choices[0].message.content)
Command Line
mlx_lm.generate --model pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx \
--prompt "Explain quantum computing"
Audio Understanding (Hybrid PyTorch + MLX)
While this MLX model only supports text inference, you can achieve audio understanding by combining it with the PyTorch audio encoder through a really hacky method where we rip the audio tower out of the base model:
import torch
import mlx.core as mx
from mlx_lm import load
from mlx_lm.generate import generate_step
from mlx_lm.sample_utils import make_sampler
from transformers import Qwen3OmniMoeProcessor, AutoConfig
from qwen_omni_utils import process_mm_info
import numpy as np
from transformers.models.qwen3_omni_moe.modeling_qwen3_omni_moe import Qwen3OmniMoeAudioEncoder
from huggingface_hub import snapshot_download
from openai import OpenAI
import glob
import os
from safetensors import safe_open
MLX_MODEL_PATH = "pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx"
HF_MODEL_PATH = "Qwen/Qwen3-Omni-30B-A3B-Instruct"
# Make a dummy audio file for testing
input_str = """
The sky above the port was the color of television, tuned to a dead channel.
"It's not like I'm using," Case heard someone say, as he shouldered his way
through the crowd around the door of the Chat. "It's like my body's developed
this massive drug deficiency." It was a Sprawl voice and a Sprawl joke.
"""
client = OpenAI(
base_url="http://localhost:10240/v1",
api_key="not-needed"
)
response = client.audio.speech.create(
model="mlx-community/Kokoro-82M-4bit",
voice="af_sky",
input=input_str,
)
response.stream_to_file("neuro_output.wav")
### Process audio → embeddings → MLX generation (so begins the jank) ###
print("Loading processor...")
processor = Qwen3OmniMoeProcessor.from_pretrained(HF_MODEL_PATH)
print("Loading audio_tower...")
config = AutoConfig.from_pretrained(HF_MODEL_PATH, trust_remote_code=True)
audio_config = config.thinker_config.audio_config
audio_tower = Qwen3OmniMoeAudioEncoder(audio_config)
audio_tower.eval()
# Load audio_tower weights
model_path = snapshot_download(HF_MODEL_PATH, allow_patterns=["*.safetensors", "*.json"])
safetensor_files = sorted(glob.glob(os.path.join(model_path, "*.safetensors")))
audio_tower_weights = {}
for st_file in safetensor_files:
with safe_open(st_file, framework="pt") as f:
for key in f.keys():
if key.startswith("thinker.audio_tower."):
new_key = key.replace("thinker.audio_tower.", "")
audio_tower_weights[new_key] = f.get_tensor(key)
audio_tower.load_state_dict(audio_tower_weights, strict=False)
print("Loading MLX language model...")
model, tokenizer = load(MLX_MODEL_PATH)
# Function to process audio and generate response
def understand_audio(audio_path, question):
"""Process audio file and answer questions about it"""
# Prepare conversation
conversation = [{"role": "user", "content": [
{"type": "audio", "audio": audio_path},
{"type": "text", "text": question}
]}]
# Process inputs
text_prompt = processor.apply_chat_template(conversation, add_generation_prompt=True, tokenize=False)
audios, images, videos = process_mm_info(conversation, use_audio_in_video=False)
inputs = processor(text=text_prompt, audio=audios, images=images, videos=videos,
return_tensors="pt", padding=True, use_audio_in_video=False)
# Process audio through audio_tower
with torch.no_grad():
audio_features = inputs['input_features'].squeeze(0)
feature_lens = inputs['feature_attention_mask'].sum(dim=1)
audio_outputs = audio_tower(audio_features, feature_lens=feature_lens)
audio_embeddings = audio_outputs.last_hidden_state
# Merge text and audio embeddings
audio_token_id = config.thinker_config.audio_token_id
input_ids_np = inputs["input_ids"][0].numpy()
audio_positions = np.where(input_ids_np == audio_token_id)[0]
embed_layer = model.language_model.model.embed_tokens
all_embeddings = embed_layer(mx.array(input_ids_np))
audio_embeddings_mlx = mx.array(audio_embeddings.cpu().numpy())
# Replace audio tokens with audio embeddings
segments = []
last_pos = 0
for i, pos in enumerate(audio_positions):
pos = int(pos)
if pos > last_pos:
segments.append(all_embeddings[last_pos:pos])
segments.append(audio_embeddings_mlx[i:i+1])
last_pos = pos + 1
if last_pos < all_embeddings.shape[0]:
segments.append(all_embeddings[last_pos:])
merged_embeddings = mx.concatenate(segments, axis=0)
# Generate response
sampler = make_sampler(temp=0.7, top_p=0.9)
dummy_prompt = mx.zeros((merged_embeddings.shape[0],), dtype=mx.int32)
eos_token_id = tokenizer.eos_token_id
im_end_id = tokenizer.convert_tokens_to_ids("<|im_end|>")
stop_tokens = {eos_token_id, im_end_id}
tokens = []
for (token, _), n in zip(generate_step(prompt=dummy_prompt, model=model,
max_tokens=100, sampler=sampler,
input_embeddings=merged_embeddings),
range(100)):
token_id = token if isinstance(token, int) else token.item()
tokens.append(token_id)
if token_id in stop_tokens:
break
response = tokenizer.decode(tokens).replace("<|im_end|>", "").replace("<|endoftext|>", "").strip()
return response
# Models will stay loaded and you can input new audio files and prompts here
response = understand_audio("neuro_output.wav", "Summarize this audio clip.")
print(response)
Model Details
| Property | Value |
|---|---|
| Base Model | Qwen3-Omni-30B-A3B-Instruct |
| Quantization | 4-bit (group_size=64, bits=4) |
| Framework | MLX / Apple Silicon |
| Components | Text-only (thinker component) |
Implementation Notes
This model is a wrapper around the qwen3_moe architecture that:
- Loads only the text language model weights
- Filters out multimodal components (audio_tower, talker, code2wav, visual)
- Supports
input_embeddingsparameter for hybrid multimodal use cases
License
Apache 2.0 (same as base model)
Acknowledgments
- Qwen Team for the original model
- MLX team for the framework
- Downloads last month
- 196
Model tree for pherber3/Qwen3-Omni-30B-A3B-Instruct-4bit-mlx
Base model
Qwen/Qwen3-Omni-30B-A3B-Instruct