--- license: apache-2.0 datasets: - ds4sd/SynthCodeNet - ds4sd/SynthFormulaNet - ds4sd/SynthChartNet - HuggingFaceM4/DoclingMatix tags: - text-generation - documents - code - formula - chart - ocr - layout - table - document-parse - docling - granite - extraction - math --- # granite-docling-258m **Model Summary**: Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with [DoclingDocuments](https://docling-project.github.io/docling/) to ensure full compatibility. Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM. - **Developed by**: IBM Research - **Model type**: Multi-modal model (image+text-to-text) - **Language(s)**: English (NLP) - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0) - **Release Date**: September 17, 2025 Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including: - ๐Ÿ”ข Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas - ๐Ÿงฉ Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference - ๐Ÿง˜ Improved Stability: Tends to avoid infinite loops more effectively - ๐Ÿงฎ Enhanceed Inline Equations: Better inline math recognition - ๐Ÿงพ Document Element QA: Answer questions about a documentโ€™s structure such as the presence and order of document elements - ๐ŸŒ Japanese, Arabic and Chinese support (_experimental_) ## Evaluations
smoldocling-256m-preview granite-docling-258m
Layout
MAP โ†‘0.210.28
F1 โ†‘0.790.85
Precision โ†‘0.860.87
Recall โ†‘0.820.89
Full Page OCR
Edit-distance โ†“0.48 (0.46)0.46 (0.44)
F1 โ†‘0.80 (0.76)0.75 (0.78)
Precision โ†‘0.89 (0.85)0.81 (0.85)
Recall โ†‘0.79 (0.74)0.73 (0.77)
BLEU โ†‘0.58 (0.54)0.56 (0.59)
Meteor โ†‘0.67 (0.67)0.67 (0.70)
Code Recognition
Edit-distance โ†“0.1140.013
F1 โ†‘0.9150.988
Precision โ†‘0.940.99
Recall โ†‘0.9090.988
BLEU โ†‘0.8750.983
Meteor โ†‘0.8890.986
Equation Recognition
Edit-distance โ†“0.1190.073
F1 โ†‘0.9470.968
Precision โ†‘0.9590.968
Recall โ†‘0.9410.969
BLEU โ†‘0.8240.893
Meteor โ†‘0.8780.927
Table Recognition (FinTabNet 150dpi)
TEDS (structure) โ†‘0.820.97
TEDS (w/content) โ†‘0.760.96
Other Benchmarks
MMStar โ†‘0.170.3
OCRBench โ†‘338500
## Getting started You can use **transformers**, **vllm**, or **onnx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of output formats (md, html, etc.):
๐Ÿ“„ Single page image inference using Tranformers ๐Ÿค– ```python # Prerequisites: # pip install torch # pip install docling_core # pip install transformers import torch from docling_core.types.doc import DoclingDocument from docling_core.types.doc.document import DocTagsDocument from transformers import AutoProcessor, AutoModelForVision2Seq from transformers.image_utils import load_image from pathlib import Path DEVICE = "cuda" if torch.cuda.is_available() else "cpu" # Load images image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg") # Initialize processor and model processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M") model = AutoModelForVision2Seq.from_pretrained( "ibm-granite/granite-docling-258M", torch_dtype=torch.bfloat16, _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager", ).to(DEVICE) # Create input messages messages = [ { "role": "user", "content": [ {"type": "image"}, {"type": "text", "text": "Convert this page to docling."} ] }, ] # Prepare inputs prompt = processor.apply_chat_template(messages, add_generation_prompt=True) inputs = processor(text=prompt, images=[image], return_tensors="pt") inputs = inputs.to(DEVICE) # Generate outputs generated_ids = model.generate(**inputs, max_new_tokens=8192) prompt_length = inputs.input_ids.shape[1] trimmed_generated_ids = generated_ids[:, prompt_length:] doctags = processor.batch_decode( trimmed_generated_ids, skip_special_tokens=False, )[0].lstrip() # Populate document doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image]) print(doctags) # create a docling document doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document") # export as any format # HTML # Path("Out/").mkdir(parents=True, exist_ok=True) # output_path_html = Path("Out/") / "example.html" # doc.save_as_html(output_path_html) # MD print(doc.export_to_markdown()) ```
๐Ÿš€ Fast Batch Inference Using VLLM ```python # Prerequisites: # pip install vllm # pip install docling_core # place page images you want to convert into "img/" dir import time import os from vllm import LLM, SamplingParams from PIL import Image from docling_core.types.doc import DoclingDocument from docling_core.types.doc.document import DocTagsDocument from pathlib import Path # Configuration MODEL_PATH = "ibm-granite/granite-docling-258M" IMAGE_DIR = "img/" # Place your page images here OUTPUT_DIR = "out/" PROMPT_TEXT = "Convert page to docling." # Ensure output directory exists os.makedirs(OUTPUT_DIR, exist_ok=True) # Initialize LLM llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1}) sampling_params = SamplingParams( temperature=0.0, max_tokens=8192 ) # Load and prepare all images and prompts up front batched_inputs = [] image_names = [] for img_file in sorted(os.listdir(IMAGE_DIR)): if img_file.lower().endswith((".png", ".jpg", ".jpeg")): img_path = os.path.join(IMAGE_DIR, img_file) with Image.open(img_path) as im: image = im.convert("RGB") prompt = ( f"<|start_of_role|>user<|end_of_role|>{PROMPT_TEXT}<|end_of_text|>\n" f"<|start_of_role|>assistant<|end_of_role|>" ) batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}}) image_names.append(os.path.splitext(img_file)[0]) # Run batch inference start_time = time.time() outputs = llm.generate(batched_inputs, sampling_params=sampling_params) # Postprocess all results for img_fn, output, input_data in zip(image_names, outputs, batched_inputs): doctags = output.outputs[0].text output_path_dt = Path(OUTPUT_DIR) / f"{img_fn}.dt" output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md" with open(output_path_dt, "w", encoding="utf-8") as f: f.write(doctags) # Convert to DoclingDocument and save markdown doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [input_data["multi_modal_data"]["image"]]) doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document") doc.save_as_markdown(output_path_md) print(f"Total time: {time.time() - start_time:.2f} sec") ```
๐Ÿ’ป Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx) ## Supported Instructions
Description Instruction Short Instruction
Full conversion Convert this page to docling. -
Chart Convert chart to table. <chart>
Formula Convert formula to LaTeX. <formula>
Code Convert code to text. <code>
Table Convert table to OTSL. (Lysak et al., 2023) <otsl>
Actions and Pipelines OCR the text in a specific location: <loc_155><loc_233><loc_206><loc_237> -
Identify element at: <loc_247><loc_482><10c_252><loc_486> -
Find all 'text' elements on the page, retrieve all section headers. -
Detect footer elements on the page. -
# Model Architecture: The architecture of granite-docling-258m consists of the following components: (1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512). (2) Vision-language connector: pixel shuffle projector (as in idefics3) (3) Large language model: Granite 165M. We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLMโ€™s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling. The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models # Training Data Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities. In particular, we incorporate: * [**SynthCodeNet**](https://huggingface.co/datasets/ds4sd/SynthCodeNet) โ€” a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages * [**SynthFormulaNet**](https://huggingface.co/datasets/ds4sd/SynthFormulaNet) โ€” a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations * [**SynthChartNet**](https://huggingface.co/datasets/ds4sd/SynthChartNet) โ€” synthetic chart images annotated with structured table outputs * [**DoclingMatix**](https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix) โ€” a curated corpus of real-world document pages sampled from diverse domains # Infrastructure: We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs. Resources - โญ๏ธ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features - ๐Ÿš€ Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/getting_started/ - ๐Ÿ’ก Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources