NVIDIA Releases 8 Million Sample Open Dataset and Tooling for OCR, Image Reasoning, Image and Video QA Tasks
Just 2 months ago, the NVIDIA Nemotron team released Llama Nemotron VLM Dataset V1 with 3 million samples of OCR, Visual Question Answering, and Captioning Tasks. Today, we are releasing the next version - Nemotron VLM Dataset V2 adding 8M samples, i.e. almost three times as many high-quality samples on top of the previous release. The combination of both of these high quality VLM post-training open datasets features 11M samples and was used for training our recently released Nemotron Nano 2 VL.
This time, our focus was on three main areas: Adding new data modalities like video, expanding our chain of thought reasoning data, and providing the community with a toolchain to generate OCR training data.
We found that for further improvements, our models needed to not just learn the right answer, but the thought process to get there. Adding more targeted chain-of-thought datasets, proved to be the key to breaking the plateau for a lot of benchmarks.
We're excited to share this data with the community and we're looking forward to seeing what you build with it.
This release is a part of NVIDIAs commitment to open and transparent model development and improvement. By sharing training data, in addition to the training tools and final model weights, NVIDIA supports continued improvement of open source models.
NVIDIA is dedicated to ethical and safe AI, as demonstrated by the comprehensive safety and compliance reviews conducted on this dataset. These reviews included verifying permissive data sources, utilizing open-source or proprietary tools and labeling models, and performing CSAM scans for content toxicity.
What's in the Dataset and How did we build it?
With this release, we are broadening the dataset scope to allow for training more capable models. We added
New Modalities and Domains: We have added a substantial amount of new data covering UI understanding, complex charts, diagrams. For the first time, we are also including video understanding tasks.
Focus on Reasoning: We have been able to break benchmark plateaus by adding more chain of thought SFT data, some of which we generated by auto labeling thinking traces for existing samples. We found that providing those traces helped especially for samples that the previous model struggled with.
Improved OCR: We further improved on the highly-competitive OCR capabilities of our first VL model by adding an even larger variety of training samples including multilingual data for ten languages.
Our OCR training data was generated using a novel LaTeX compilation pipeline that we used to process arXiv papers. Unlike "convert-to-HTML" methods that lose layout and semantics, our pipeline patches the TeX engine to emit precise glyph positions and semantic context, compiling LaTeX directly into richly labeled datasets with PDF page images, 2D bounding boxes, formatted text in Markdown, and semantic classes such as Title, Section-Header, Footnote, and Math. We're open-sourcing the pipeline so you can compile millions of LaTeX documents into training datasets. In addition, you can use our augmentation pipeline to obtain diverse layouts, fonts, and languages, and translate content for multilingual data generation.
The Nemotron VLM Dataset V2 is composed of 55% image QA samples, 25% OCR samples, 10% video QA samples, and 10% image reasoning samples. Developers can use the dataset entirely or curate it further using the NVIDIA NeMo Curator to process high-quality datasets for training highly accurate VLM models.
We re-annotated popular visual question answering datasets with proprietary and open-source technologies, such that the data can be permissively used for training. To extract additional rich information from input images that is not fully present in the existing QA annotations, we used commercially permissive models and pipelines to generate fine-grained descriptions of the images.
Examples of augmentation include: Adding chain of thought explanations, rule-based QA generation (e.g., with templates), expanding short answers into longer responses, as well as proper reformatting.
| Category | No. of Samples in V1 | No. of Samples Added in V2 |
|---|---|---|
| Image QA | 1,917,755 | 4,466,777 |
| OCR | 814,381 | 2,118,483 |
| Captioning | 131,718 | N/A |
| Video QA | N/A | 829,201 |
| Image Reasoning | N/A | 733,138 |
| Total | 2,863,854 | 8,147,599 |
Nemotron VLM Training Dataset was purpose-built to support enterprise use cases and production-grade AI development. It is ready for commercial use.
Data Examples
Example 1: Video Understanding
Dataset: ego_exo_learn, Single frame from video: beead82c-ac78-11ee-819f-80615f12b59e.mp4. Shared under MIT license
Prompt: From the camera carrier's perspective, narrate the movements and interactions involving their hands.
Response: The left hand injects the solution into the small test tube filled with white powder.
Example 2: Chart Reasoning with Chain of Thought
Dataset: plotqa_cot, 8106.png. Shared under CC-BY-4.0 license
Prompt: How many legend labels are there ? Just give me the answer, no need for the process.
Response: <think>Got it, let's look at the legend on the right side of the image. The legend has three labels: "Self-employed", "Unpaid family workers", and "Wage workers". So we need to count how many labels there are. Let's count them: first one is Self-employed, second is Unpaid family workers, third is Wage workers. So that's three labels.</think>
3
Prompt: What is the label or title of the X-axis ? Directly provide the answer.
Response: <think>Got it, let's look at the graph. The X-axis has labels like "2009" and "2012", and the title of the X-axis is "Year". Wait, let me check. The X-axis label is "Year" because that's what's written below the axis. So the title of the X-axis is "Year".</think>
Year
How to Get It
👉 Explore the dataset here on Hugging Face to use it for your projects!


