Moondream 3 HF
Moondream 3 HF is a reimplementation of the Moondream 3 (Preview) model using the standard Hugging Face Transformers architecture conventions.
Overview
- Multimodal vision-language model with a mixture-of-experts (MoE) text backbone
- Architecture and weights correspond to Moondream 3 (Preview) (approximately 9B parameters, 2B active)
- Implemented as standard Transformers components:
Moondream3ForConditionalGenerationMoondream3Model,Moondream3TextModel,Moondream3VisionModelMoondream3Processor,Moondream3ImageProcessor,Moondream3Config
The purpose of this repository is to make Moondream 3 interoperable with the Hugging Face ecosystem so it can be used directly with the Transformers API, including generate(), Trainer, and PEFT integrations.
Example usage
Example for running multimodal inference with the moondream3-hf implementation:
import torch
from PIL import Image
from transformers import AutoProcessor, AutoModelForCausalLM
DEVICE="cuda:0"
model = AutoModelForCausalLM.from_pretrained("NyxKrage/moondream3-hf", dtype="bfloat16", device_map=DEVICE, trust_remote_code=True)
processor = AutoProcessor.from_pretrained("NyxKrage/moondream3-hf", use_fast=False, trust_remote_code=True)
image1 = Image.open("image1.jpg")
image2 = Image.open("image2.jpg")
text = [processor.apply_chat_template("", tokenize=False)] * 2
inputs = processor(text=text, images=[
image1,
image2,
])
inputs = {k: v.to(DEVICE) for k,v in inputs.items()}
model.eval()
with torch.no_grad():
outputs = model.generate(
**inputs,
use_cache=True,
)
outputs = [
out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs["input_ids"], outputs)
]
for output in outputs:
print(processor.decode(output))
The chat_template uses Hugging Face’s Jinja format and accepts either a single string or a sequence of messages (user [, assistant]).
Training
The model can be trained using trl and supports peft and bitsandbytes out of the box.
Included in the repo is also an implementation which replaces the MoE layers with a grouped_gemm implementation which has been adapted from github:woct0rdho/transformers-qwen3-moe-fused, and can be used by importing Moondream3ForConditionalGeneration from modeling_moondream3_fusedmoe.py instead.
Prompting modes
The chat template supports multiple task types via text prefixes:
| Mode | Template prefix | Example input |
|---|---|---|
| Query | query: |
query: What is happening in this image? |
| Caption | caption:[short/normal/long] |
caption: long |
| Detect | detect: |
detect: dog |
| Point | point: |
point: red car |
If no prefix is provided, the default mode is query:.
This reimplementation aims to provide interoperability and ease of experimentation within the Hugging Face ecosystem. It is not an official release.
License
The model weights remain under the Business Source License 1.1 with an Additional Use Grant (No Third-Party Service), identical to the original Moondream 3 Preview license.
This allows research, personal, and most commercial use, but prohibits offering hosted or resold access that competes with M87 Labs’ paid services.
All new implementation code in this repository is released under the Apache 2.0 License.
Credits
- Original model and research: M87 Labs / Moondream AI
- Hugging Face–compatible reimplementation: NyxKrage
- Based on the public Moondream 3 Preview release
- Downloads last month
- 247