Finally, our new paper is out! "๐๐ถ๐ป๐ฒ๐ฉ๐ถ๐๐ถ๐ผ๐ป: ๐ข๐ฝ๐ฒ๐ป ๐๐ฎ๐๐ฎ ๐๐ ๐๐น๐น ๐ฌ๐ผ๐ ๐ก๐ฒ๐ฒ๐ฑ"! ๐ฅณ FineVision: Open Data Is All You Need (2510.17269)
If you've ever trained a VLM, you know this problem: nobody shares their data mixtures. It's a black box, making replicating SOTA work impossible. We wanted to change that.
FineVision unifies 200 sources into 24 million samples. With 17.3 million images and 9.5 billion answer tokens, it's the largest open resource of its kind.
In the paper, we share how we built it: ๐ finding and cleaning data at scale ๐งน removing excessive duplicates across sources ๐ค decontaminating against 66 public benchmarks
My favorite part is Figure 6 (in the video!). It's our visual diversity analysis. It shows that FineVision isn't just bigger; it's more balanced and conceptually richer than other open datasets. NVIDIA's Eagle 2 paper highlighted just how critical this visual diversity is, and our results confirm it: models trained on FineVision consistently outperform those trained on any other open dataset on 11 benchmarks!
๐ To celebrate the paper, Iโm also releasing a concatenated and shuffled version of the full dataset! ๐HuggingFaceM4/FineVision_full_shuffled
Itโs ready to stream, so you can start training your own models right away:
from datasets import load_dataset d = load_dataset("HuggingFaceM4/FineVision_full_shuffled", split="train", streaming=True) print(next(iter(d)))
A big shoutout to the first authors: Luis Wiedmann and Orr Zohar. They are rockstars!
deepseek-ai/DeepSeek-OCR is out! ๐ฅ my take โคต๏ธ > pretty insane it can parse and re-render charts in HTML > it uses CLIP and SAM features concatenated, so better grounding > very efficient per vision tokens/performance ratio > covers 100 languages
IBM just released small swiss army knife for the document models: granite-docling-258M on Hugging Face ๐ฅ
> not only a document converter but also can do document question answering, understand multiple languages ๐คฏ > best part: released with Apache 2.0 license ๐ use it with your commercial projects! > it supports transformers, vLLM and MLX from the get-go! ๐ค > built on SigLIP2 & granite-165M
first vision language model built off openai/gpt-oss-20b just dropped! ๐ฅ
InternVL3.5 comes with 32 models ๐คฏ pre-trained, fine-tuned, aligned in various sizes OpenGVLab/internvl35-68ac87bd52ebe953485927fb comes with gpt-oss or Qwen3 for LLM part โคต๏ธ
Many VLMs claim to process hours of video. But can they follow the story?๐ค Today, we introduce TimeScope: The benchmark that separates true temporal understanding from marketing hype. Let's see how much VLMs really understand!โณ
We test three skills that matter for real-world use: ๐ Localized Retrieval: Find a specific action. ๐งฉ Information Synthesis: Piece together scattered clues. ๐ Fine-Grained Perception: Analyze detailed motion (e.g., count how many times a person swings an axe).
The results are in, and they're revealing. Only Gemini 2.5 pro handles 1-hour-long videos. Performance drops sharply with duration, proving that long video understanding is still challenging. We've found the breaking pointsโnow the community can start fixing them.๐
Want to learn more? TimeScope is 100% open-source. Benchmark your model and help us build the next generation of video AI.