YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)
Xiaomi-MiMo-Vl-Miloco


MiMo-VL-7B-GGUF

This repository provides a quantized version of MiMo-VL-Miloco-7B. It includes two components:

  • Language model (LLM): Q4_0
  • Visual encoder (mmproj): BF16

These files are built with llama.cpp, support inference on multiple platforms, and are intended for resource-constrained scenarios.

Introduction

Welcome to Xiaomi MiMo-VL-Miloco — the first open-source multimodal model built to actually understand what’s happening at home!

🤗 Why you’ll love it:

  • Built on MiMo-VL-7B: a rock-solid vision–language backbone with reliable video understanding and instruction-following.
  • Home-savvy by design: it spots everyday activities (esports, workouts, watching TV, reading, and more) and reads common hand gestures like the V sign, thumbs-up, open palm, OK, and even the shaka hand sign.
  • Base skills intact: with a mix training strategy of SFT and RL, we boost home-scene smarts while keeping the model’s generality and transferability in great shape.

🌟 Training recipe:

We use a carefully tuned two-stage pipeline to nail home-scene skills without sacrificing general abilities.

Stage 1: Supervised Fine-Tuning (SFT)

This stage focuses on boosting the model’s core capabilities in home scenarios. Even with a limited training set, we strike a good balance between sample-efficient learning and fast inference:

  • Chain-of-thought supervision: we add chain of reasoning so the model learns structured knowledge about home scenarios.
  • Token-budget-aware reasoning: training with “budgeted” reasoning encourages concise, straight-to-the-point answers at inference.

Stage 2: Reinforcement Learning (RL)

Building on fine-tuning, this stage introduces GRPO-based reinforcement learning to enhance the model’s overall performance:

  • Efficient Training Data: we employed the Time-R1 data strategy (our work accepted at NeurIPS 2025) to build efficient training datasets across multiple domains.
  • Keep-it-general: specialize for home tasks while preserving broad understanding and language generation.

In short: Xiaomi MiMo-VL-Miloco is your friendly, sharp-eyed model roommate—great at recognizing what’s going on around the house, and still ready for the wider world.

Performance

Evaluation of Home-Scenario Undersatnding Capabilities (F1-Score)

  • MiMo-VL-Miloco-7B achieves leading performance in both gesture recognition and common household scene understanding.
Accuracy & Recall

Results of general capability evaluations

In household scene understanding, we prioritize video and image perception alongside the model’s reasoning ability.

  • Across three video benchmarks (Video-MME, Video-MMMU, Charades-STA), the base model shows clear improvements.
  • On MMMU-Pro, a general-capabilities benchmark, the base model also saw significant improvements (10+%).
  • Surprisingly, as video and image understanding improved, we observed corresponding gains on the text-only task MMLU-Pro.
  • We see a modest performance dip on tasks such as document understanding, OCR, and mathematics; this is in line with expectations and does not affect the model’s intended use cases.
Accuracy & Recall

Citation

@misc{xiaomimimovlmiloco,
  author       = {Jiaze Li, Yuxun Qu, Jingyang Chen, Shijie Xu, Zhenru Lin, Junyou Zhu, Boshen Xu, Wenhui Tan, Pei Fu, JianZhong Ju, Zhenbo Luo, Jian Luan},
  title        = {Xiaomi MiMo-VL-Miloco},
  year         = {2025},
  howpublished = {\url{https://github.com/XiaoMi/xiaomi-mimo-vl-miloco}},
}

Contact

Please contact us at [email protected] or open an issue if you have any questions.

Downloads last month
2,141
GGUF
Model size
8B params
Architecture
qwen2vl
Hardware compatibility
Log In to view the estimation

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including xiaomi-open-source/Xiaomi-MiMo-VL-Miloco-7B-GGUF