--- license: apache-2.0 datasets: - ds4sd/SynthCodeNet - ds4sd/SynthFormulaNet - ds4sd/SynthChartNet - HuggingFaceM4/DoclingMatix tags: - text-generation - documents - code - formula - chart - ocr - layout - table - document-parse - docling - granite - extraction - math --- # granite-docling-258m **Model Summary**: Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with [DoclingDocuments](https://docling-project.github.io/docling/) to ensure full compatibility. Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM. - **Developed by**: IBM Research - **Model type**: Multi-modal model (image+text-to-text) - **Language(s)**: English (NLP) - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) - **Release Date**: September 17, 2025 Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including: - 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas - 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference - 🧘 Improved Stability: Tends to avoid infinite loops more effectively - 🧮 Enhanceed Inline Equations: Better inline math recognition - 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements - 🌍 Japanese, Arabic and Chinese support (_experimental_) ## Evaluations

	smoldocling-256m-preview	granite-docling-258m
Layout
MAP ↑	0.21	0.28
F1 ↑	0.79	0.85
Precision ↑	0.86	0.87
Recall ↑	0.82	0.89
Full Page OCR
Edit-distance ↓	0.48 (0.46)	0.46 (0.44)
F1 ↑	0.80 (0.76)	0.75 (0.78)
Precision ↑	0.89 (0.85)	0.81 (0.85)
Recall ↑	0.79 (0.74)	0.73 (0.77)
BLEU ↑	0.58 (0.54)	0.56 (0.59)
Meteor ↑	0.67 (0.67)	0.67 (0.70)
Code Recognition
Edit-distance ↓	0.114	0.013
F1 ↑	0.915	0.988
Precision ↑	0.94	0.99
Recall ↑	0.909	0.988
BLEU ↑	0.875	0.983
Meteor ↑	0.889	0.986
Equation Recognition
Edit-distance ↓	0.119	0.073
F1 ↑	0.947	0.968
Precision ↑	0.959	0.968
Recall ↑	0.941	0.969
BLEU ↑	0.824	0.893
Meteor ↑	0.878	0.927
Table Recognition (FinTabNet 150dpi)
TEDS (structure) ↑	0.82	0.97
TEDS (w/content) ↑	0.76	0.96
Other Benchmarks
MMStar ↑	0.17	0.3
OCRBench ↑	338	500

## Getting started You can use **transformers**, **vllm**, or **onnx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of output formats (md, html, etc.):

📄 Single page image inference using Tranformers 🤖

```python # Prerequisites: # pip install torch # pip install docling_core # pip install transformers import torch from docling_core.types.doc import DoclingDocument from docling_core.types.doc.document import DocTagsDocument from transformers import AutoProcessor, AutoModelForVision2Seq from transformers.image_utils import load_image from pathlib import Path DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # Load images image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg") # Initialize processor and model processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M") model = AutoModelForVision2Seq.from_pretrained( "ibm-granite/granite-docling-258M", torch_dtype=torch.bfloat16, _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager", ).to(DEVICE) # Create input messages messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Convert this page to docling."} ] }, ] # Prepare inputs prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=prompt, images=[image], return_tensors="pt") inputs = inputs.to(DEVICE) # Generate outputs generated_ids = model.generate(**inputs, max_new_tokens=8192) prompt_length = inputs.input_ids.shape[1] trimmed_generated_ids = generated_ids[:, prompt_length:] doctags = processor.batch_decode( trimmed_generated_ids, skip_special_tokens=False, )[0].lstrip() # Populate document doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image]) print(doctags) # create a docling document doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document") # export as any format # HTML # Path("Out/").mkdir(parents=True, exist_ok=True) # output_path_html = Path("Out/") / "example.html" # doc.save_as_html(output_path_html) # MD print(doc.export_to_markdown()) ```

🚀 Fast Batch Inference Using VLLM

```python # Prerequisites: # pip install vllm # pip install docling_core # place page images you want to convert into "img/" dir import time import os from vllm import LLM, SamplingParams from PIL import Image from docling_core.types.doc import DoclingDocument from docling_core.types.doc.document import DocTagsDocument from pathlib import Path # Configuration MODEL_PATH = "ibm-granite/granite-docling-258M" IMAGE_DIR = "img/" # Place your page images here OUTPUT_DIR = "out/" PROMPT_TEXT = "Convert page to docling." # Ensure output directory exists os.makedirs(OUTPUT_DIR, exist_ok=True) # Initialize LLM llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1}) sampling_params = SamplingParams( temperature=0.0, max_tokens=8192 ) # Load and prepare all images and prompts up front batched_inputs = [] image_names = [] for img_file in sorted(os.listdir(IMAGE_DIR)): if img_file.lower().endswith((".png", ".jpg", ".jpeg")): img_path = os.path.join(IMAGE_DIR, img_file) with Image.open(img_path) as im: image = im.convert("RGB") prompt = ( f"<|start_of_role|>user<|end_of_role|>{PROMPT_TEXT}<|end_of_text|>\n" f"<|start_of_role|>assistant<|end_of_role|>" ) batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}}) image_names.append(os.path.splitext(img_file)[0]) # Run batch inference start_time = time.time() outputs = llm.generate(batched_inputs, sampling_params=sampling_params) # Postprocess all results for img_fn, output, input_data in zip(image_names, outputs, batched_inputs): doctags = output.outputs[0].text output_path_dt = Path(OUTPUT_DIR) / f"{img_fn}.dt" output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md" with open(output_path_dt, "w", encoding="utf-8") as f: f.write(doctags) # Convert to DoclingDocument and save markdown doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [input_data["multi_modal_data"]["image"]]) doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document") doc.save_as_markdown(output_path_md) print(f"Total time: {time.time() - start_time:.2f} sec") ```

💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx) ## Supported Instructions

Description	Instruction	Short Instruction
Full conversion	Convert this page to docling.	-
Chart	Convert chart to table.	`<chart>`
Formula	Convert formula to LaTeX.	`<formula>`
Code	Convert code to text.	`<code>`
Table	Convert table to OTSL. (Lysak et al., 2023)	`<otsl>`
Actions and Pipelines	OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237>	-
	Identify element at: <loc_247><loc_482><10c_252><loc_486>	-
	Find all 'text' elements on the page, retrieve all section headers.	-
	Detect footer elements on the page.	-

# Model Architecture: The architecture of granite-docling-258m consists of the following components: (1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512). (2) Vision-language connector: pixel shuffle projector (as in idefics3) (3) Large language model: Granite 165M. We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling. The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models # Training Data Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities. In particular, we incorporate: * [**SynthCodeNet**](https://huggingface.co/datasets/ds4sd/SynthCodeNet) — a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages * [**SynthFormulaNet**](https://huggingface.co/datasets/ds4sd/SynthFormulaNet) — a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations * [**SynthChartNet**](https://huggingface.co/datasets/ds4sd/SynthChartNet) — synthetic chart images annotated with structured table outputs * [**DoclingMatix**](https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix) — a curated corpus of real-world document pages sampled from diverse domains # Infrastructure: We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Resources - ⭐️ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features - 🚀 Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/getting_started/ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources