---
license: apache-2.0
datasets:
- ds4sd/SynthCodeNet
- ds4sd/SynthFormulaNet
- ds4sd/SynthChartNet
- HuggingFaceM4/DoclingMatix
tags:
- text-generation
- documents
- code
- formula
- chart
- ocr
- layout
- table
- document-parse
- docling
- granite
- extraction
- math
---
# granite-docling-258m
**Model Summary**: Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with [DoclingDocuments](https://docling-project.github.io/docling/) to ensure full compatibility.
Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM.
- **Developed by**: IBM Research
- **Model type**: Multi-modal model (image+text-to-text)
- **Language(s)**: English (NLP)
- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
- **Release Date**: September 17, 2025
Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including:
- ๐ข Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
- ๐งฉ Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
- ๐ง Improved Stability: Tends to avoid infinite loops more effectively
- ๐งฎ Enhanceed Inline Equations: Better inline math recognition
- ๐งพ Document Element QA: Answer questions about a documentโs structure such as the presence and order of document elements
- ๐ Japanese, Arabic and Chinese support (_experimental_)
## Evaluations
|
smoldocling-256m-preview |
granite-docling-258m |
| Layout |
| MAP โ | 0.21 | 0.28 |
| F1 โ | 0.79 | 0.85 |
| Precision โ | 0.86 | 0.87 |
| Recall โ | 0.82 | 0.89 |
| Full Page OCR |
| Edit-distance โ | 0.48 (0.46) | 0.46 (0.44) |
| F1 โ | 0.80 (0.76) | 0.75 (0.78) |
| Precision โ | 0.89 (0.85) | 0.81 (0.85) |
| Recall โ | 0.79 (0.74) | 0.73 (0.77) |
| BLEU โ | 0.58 (0.54) | 0.56 (0.59) |
| Meteor โ | 0.67 (0.67) | 0.67 (0.70) |
| Code Recognition |
| Edit-distance โ | 0.114 | 0.013 |
| F1 โ | 0.915 | 0.988 |
| Precision โ | 0.94 | 0.99 |
| Recall โ | 0.909 | 0.988 |
| BLEU โ | 0.875 | 0.983 |
| Meteor โ | 0.889 | 0.986 |
| Equation Recognition |
| Edit-distance โ | 0.119 | 0.073 |
| F1 โ | 0.947 | 0.968 |
| Precision โ | 0.959 | 0.968 |
| Recall โ | 0.941 | 0.969 |
| BLEU โ | 0.824 | 0.893 |
| Meteor โ | 0.878 | 0.927 |
| Table Recognition (FinTabNet 150dpi) |
| TEDS (structure) โ | 0.82 | 0.97 |
| TEDS (w/content) โ | 0.76 | 0.96 |
| Other Benchmarks |
| MMStar โ | 0.17 | 0.3 |
| OCRBench โ | 338 | 500 |
## Getting started
You can use **transformers**, **vllm**, or **onnx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of output formats (md, html, etc.):
๐ Single page image inference using Tranformers ๐ค
```python
# Prerequisites:
# pip install torch
# pip install docling_core
# pip install transformers
import torch
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from transformers import AutoProcessor, AutoModelForVision2Seq
from transformers.image_utils import load_image
from pathlib import Path
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
# Load images
image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")
# Initialize processor and model
processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
model = AutoModelForVision2Seq.from_pretrained(
"ibm-granite/granite-docling-258M",
torch_dtype=torch.bfloat16,
_attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
).to(DEVICE)
# Create input messages
messages = [
{
"role": "user",
"content": [
{"type": "image"},
{"type": "text", "text": "Convert this page to docling."}
]
},
]
# Prepare inputs
prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=prompt, images=[image], return_tensors="pt")
inputs = inputs.to(DEVICE)
# Generate outputs
generated_ids = model.generate(**inputs, max_new_tokens=8192)
prompt_length = inputs.input_ids.shape[1]
trimmed_generated_ids = generated_ids[:, prompt_length:]
doctags = processor.batch_decode(
trimmed_generated_ids,
skip_special_tokens=False,
)[0].lstrip()
# Populate document
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
print(doctags)
# create a docling document
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
# export as any format
# HTML
# Path("Out/").mkdir(parents=True, exist_ok=True)
# output_path_html = Path("Out/") / "example.html"
# doc.save_as_html(output_path_html)
# MD
print(doc.export_to_markdown())
```
๐ Fast Batch Inference Using VLLM
```python
# Prerequisites:
# pip install vllm
# pip install docling_core
# place page images you want to convert into "img/" dir
import time
import os
from vllm import LLM, SamplingParams
from PIL import Image
from docling_core.types.doc import DoclingDocument
from docling_core.types.doc.document import DocTagsDocument
from pathlib import Path
# Configuration
MODEL_PATH = "ibm-granite/granite-docling-258M"
IMAGE_DIR = "img/" # Place your page images here
OUTPUT_DIR = "out/"
PROMPT_TEXT = "Convert page to docling."
# Ensure output directory exists
os.makedirs(OUTPUT_DIR, exist_ok=True)
# Initialize LLM
llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
sampling_params = SamplingParams(
temperature=0.0,
max_tokens=8192
)
# Load and prepare all images and prompts up front
batched_inputs = []
image_names = []
for img_file in sorted(os.listdir(IMAGE_DIR)):
if img_file.lower().endswith((".png", ".jpg", ".jpeg")):
img_path = os.path.join(IMAGE_DIR, img_file)
with Image.open(img_path) as im:
image = im.convert("RGB")
prompt = (
f"<|start_of_role|>user<|end_of_role|>{PROMPT_TEXT}<|end_of_text|>\n"
f"<|start_of_role|>assistant<|end_of_role|>"
)
batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})
image_names.append(os.path.splitext(img_file)[0])
# Run batch inference
start_time = time.time()
outputs = llm.generate(batched_inputs, sampling_params=sampling_params)
# Postprocess all results
for img_fn, output, input_data in zip(image_names, outputs, batched_inputs):
doctags = output.outputs[0].text
output_path_dt = Path(OUTPUT_DIR) / f"{img_fn}.dt"
output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md"
with open(output_path_dt, "w", encoding="utf-8") as f:
f.write(doctags)
# Convert to DoclingDocument and save markdown
doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [input_data["multi_modal_data"]["image"]])
doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
doc.save_as_markdown(output_path_md)
print(f"Total time: {time.time() - start_time:.2f} sec")
```
๐ป Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx)
## Supported Instructions
| Description |
Instruction |
Short Instruction |
| Full conversion |
Convert this page to docling. |
- |
| Chart |
Convert chart to table. |
<chart> |
| Formula |
Convert formula to LaTeX. |
<formula> |
| Code |
Convert code to text. |
<code> |
| Table |
Convert table to OTSL. (Lysak et al., 2023) |
<otsl> |
| Actions and Pipelines |
OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> |
- |
| Identify element at: <loc_247><loc_482><10c_252><loc_486> |
- |
| Find all 'text' elements on the page, retrieve all section headers. |
- |
| Detect footer elements on the page. |
- |
# Model Architecture:
The architecture of granite-docling-258m consists of the following components:
(1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512).
(2) Vision-language connector: pixel shuffle projector (as in idefics3)
(3) Large language model: Granite 165M.
We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLMโs supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling.
The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models
# Training Data
Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities.
In particular, we incorporate:
* [**SynthCodeNet**](https://huggingface.co/datasets/ds4sd/SynthCodeNet) โ a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages
* [**SynthFormulaNet**](https://huggingface.co/datasets/ds4sd/SynthFormulaNet) โ a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations
* [**SynthChartNet**](https://huggingface.co/datasets/ds4sd/SynthChartNet) โ synthetic chart images annotated with structured table outputs
* [**DoclingMatix**](https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix) โ a curated corpus of real-world document pages sampled from diverse domains
# Infrastructure:
We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
Resources
- โญ๏ธ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features
- ๐ Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/getting_started/
- ๐ก Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources