Use untied branch as default

Browse files

Files changed (5) hide show

.gitattributes +1 -0
README.md +331 -0
config.json +65 -0
generation_config.json +8 -0
model.safetensors +3 -0

.gitattributes CHANGED Viewed

@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text

 *.zip filter=lfs diff=lfs merge=lfs -text
 *.zst filter=lfs diff=lfs merge=lfs -text
 *tfevents* filter=lfs diff=lfs merge=lfs -text
+granite_docling.png filter=lfs diff=lfs merge=lfs -text

README.md ADDED Viewed

	@@ -0,0 +1,331 @@

+---
+license: apache-2.0
+datasets:
+- ds4sd/SynthCodeNet
+- ds4sd/SynthFormulaNet
+- ds4sd/SynthChartNet
+- HuggingFaceM4/DoclingMatix
+tags:
+- text-generation
+- documents
+- code
+- formula
+- chart
+- ocr
+- layout
+- table
+- document-parse
+- docling
+- granite
+- extraction
+- math
+---
+# granite-docling-258m
+**Model Summary**: Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with [DoclingDocuments](https://docling-project.github.io/docling/) to ensure full compatibility.
+Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM.
+- **Developed by**: IBM Research
+- **Model type**: Multi-modal model (image+text-to-text)
+- **Language(s)**: English (NLP)
+- **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
+- **Release Date**: September 17, 2025
+Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including:
+- 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
+- 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
+- 🧘 Improved Stability: Tends to avoid infinite loops more effectively
+- 🧮 Enhanceed Inline Equations: Better inline math recognition
+- 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements
+- 🌍 Japanese, Arabic and Chinese support (_experimental_)
+## Evaluations
+<table>
+  <thead>
+    <tr>
+      <th></th>
+      <th><b>smoldocling-256m-preview</b></th>
+      <th><b>granite-docling-258m</b></th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr><td colspan="3"><b>Layout</b></td></tr>
+    <tr><td>MAP ↑</td><td>0.21</td><td><b>0.28</b></td></tr>
+    <tr><td>F1 ↑</td><td>0.79</td><td><b>0.85</b></td></tr>
+    <tr><td>Precision ↑</td><td>0.86</td><td><b>0.87</b></td></tr>
+    <tr><td>Recall ↑</td><td>0.82</td><td><b>0.89</b></td></tr>
+    <tr><td colspan="3"><b>Full Page OCR</b></td></tr>
+    <tr><td>Edit-distance ↓</td><td>0.48 (0.46)</td><td><b>0.46</b> (<b>0.44</b>)</td></tr>
+    <tr><td>F1 ↑</td><td><b>0.80</b> (0.76)</td><td>0.75 (<b>0.78</b>)</td></tr>
+    <tr><td>Precision ↑</td><td><b>0.89</b> (0.85)</td><td>0.81 (0.85)</td></tr>
+    <tr><td>Recall ↑</td><td><b>0.79</b> (0.74)</td><td>0.73 (<b>0.77</b>)</td></tr>
+    <tr><td>BLEU ↑</td><td><b>0.58</b> (0.54)</td><td>0.56 (<b>0.59</b>)</td></tr>
+    <tr><td>Meteor ↑</td><td>0.67 (0.67)</td><td>0.67 (<b>0.70</b>)</td></tr>
+    <tr><td colspan="3"><b>Code Recognition</b></td></tr>
+    <tr><td>Edit-distance ↓</td><td>0.114</td><td><b>0.013</b></td></tr>
+    <tr><td>F1 ↑</td><td>0.915</td><td><b>0.988</b></td></tr>
+    <tr><td>Precision ↑</td><td>0.94</td><td><b>0.99</b></td></tr>
+    <tr><td>Recall ↑</td><td>0.909</td><td><b>0.988</b></td></tr>
+    <tr><td>BLEU ↑</td><td>0.875</td><td><b>0.983</b></td></tr>
+    <tr><td>Meteor ↑</td><td>0.889</td><td><b>0.986</b></td></tr>
+    <tr><td colspan="3"><b>Equation Recognition</b></td></tr>
+    <tr><td>Edit-distance ↓</td><td>0.119</td><td><b>0.073</b></td></tr>
+    <tr><td>F1 ↑</td><td>0.947</td><td><b>0.968</b></td></tr>
+    <tr><td>Precision ↑</td><td>0.959</td><td><b>0.968</b></td></tr>
+    <tr><td>Recall ↑</td><td>0.941</td><td><b>0.969</b></td></tr>
+    <tr><td>BLEU ↑</td><td>0.824</td><td><b>0.893</b></td></tr>
+    <tr><td>Meteor ↑</td><td>0.878</td><td><b>0.927</b></td></tr>
+    <tr><td colspan="3"><b>Table Recognition (FinTabNet 150dpi)</b></td></tr>
+    <tr><td>TEDS (structure) ↑</td><td>0.82</td><td><b>0.97</b></td></tr>
+    <tr><td>TEDS (w/content) ↑</td><td>0.76</td><td><b>0.96</b></td></tr>
+    <tr><td colspan="3"><b>Other Benchmarks</b></td></tr>
+    <tr><td>MMStar ↑</td><td>0.17</td><td><b>0.3</b></td></tr>
+    <tr><td>OCRBench ↑</td><td>338</td><td><b>500</b></td></tr>
+</table>
+## Getting started
+You can use **transformers**, **vllm**, or **onnx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of output formats (md, html, etc.):
+<details>
+<summary>📄 Single page image inference using Tranformers 🤖</summary>
+```python
+# Prerequisites:
+# pip install torch
+# pip install docling_core
+# pip install transformers
+import torch
+from docling_core.types.doc import DoclingDocument
+from docling_core.types.doc.document import DocTagsDocument
+from transformers import AutoProcessor, AutoModelForVision2Seq
+from transformers.image_utils import load_image
+from pathlib import Path
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+# Load images
+image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")
+# Initialize processor and model
+processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
+model = AutoModelForVision2Seq.from_pretrained(
+    "ibm-granite/granite-docling-258M",
+    torch_dtype=torch.bfloat16,
+    _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
+).to(DEVICE)
+# Create input messages
+messages = [
+    {
+        "role": "user",
+        "content": [
+            {"type": "image"},
+            {"type": "text", "text": "Convert this page to docling."}
+        ]
+    },
+]
+# Prepare inputs
+prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
+inputs = processor(text=prompt, images=[image], return_tensors="pt")
+inputs = inputs.to(DEVICE)
+# Generate outputs
+generated_ids = model.generate(**inputs, max_new_tokens=8192)
+prompt_length = inputs.input_ids.shape[1]
+trimmed_generated_ids = generated_ids[:, prompt_length:]
+doctags = processor.batch_decode(
+    trimmed_generated_ids,
+    skip_special_tokens=False,
+)[0].lstrip()
+# Populate document
+doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
+print(doctags)
+# create a docling document
+doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
+# export as any format
+# HTML
+# Path("Out/").mkdir(parents=True, exist_ok=True)
+# output_path_html = Path("Out/") / "example.html"
+# doc.save_as_html(output_path_html)
+# MD
+print(doc.export_to_markdown())
+```
+</details>
+<details>
+<summary> 🚀 Fast Batch Inference Using VLLM</summary>
+```python
+# Prerequisites:
+# pip install vllm
+# pip install docling_core
+# place page images you want to convert into "img/" dir
+import time
+import os
+from vllm import LLM, SamplingParams
+from PIL import Image
+from docling_core.types.doc import DoclingDocument
+from docling_core.types.doc.document import DocTagsDocument
+from pathlib import Path
+# Configuration
+MODEL_PATH = "ibm-granite/granite-docling-258M"
+IMAGE_DIR = "img/"  # Place your page images here
+OUTPUT_DIR = "out/"
+PROMPT_TEXT = "Convert page to docling."
+# Ensure output directory exists
+os.makedirs(OUTPUT_DIR, exist_ok=True)
+# Initialize LLM
+llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
+sampling_params = SamplingParams(
+    temperature=0.0,
+    max_tokens=8192
+)
+# Load and prepare all images and prompts up front
+batched_inputs = []
+image_names = []
+for img_file in sorted(os.listdir(IMAGE_DIR)):
+    if img_file.lower().endswith((".png", ".jpg", ".jpeg")):
+        img_path = os.path.join(IMAGE_DIR, img_file)
+        with Image.open(img_path) as im:
+            image = im.convert("RGB")
+        prompt = (
+            f"<|start_of_role|>user<|end_of_role|><image>{PROMPT_TEXT}<|end_of_text|>\n"
+            f"<|start_of_role|>assistant<|end_of_role|>"
+        )
+        batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})
+        image_names.append(os.path.splitext(img_file)[0])
+# Run batch inference
+start_time = time.time()
+outputs = llm.generate(batched_inputs, sampling_params=sampling_params)
+# Postprocess all results
+for img_fn, output, input_data in zip(image_names, outputs, batched_inputs):
+    doctags = output.outputs[0].text
+    output_path_dt = Path(OUTPUT_DIR) / f"{img_fn}.dt"
+    output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md"
+    with open(output_path_dt, "w", encoding="utf-8") as f:
+        f.write(doctags)
+    # Convert to DoclingDocument and save markdown
+    doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [input_data["multi_modal_data"]["image"]])
+    doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
+    doc.save_as_markdown(output_path_md)
+print(f"Total time: {time.time() - start_time:.2f} sec")
+```
+</details>
+💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx)
+## Supported Instructions
+<table>
+  <tr>
+    <th>Description</th>
+    <th>Instruction</th>
+    <th>Short Instruction</th>
+  </tr>
+  <tr>
+    <td><b>Full conversion</b></td>
+    <td>Convert this page to docling.</td>
+    <td>-</td>
+  </tr>
+  <tr>
+    <td><b>Chart</b></td>
+    <td>Convert chart to table.</td>
+    <td><code>&lt;chart&gt;</code></td>
+  </tr>
+  <tr>
+    <td><b>Formula</b></td>
+    <td>Convert formula to LaTeX.</td>
+    <td><code>&lt;formula&gt;</code></td>
+  </tr>
+  <tr>
+    <td><b>Code</b></td>
+    <td>Convert code to text.</td>
+    <td><code>&lt;code&gt;</code></td>
+  </tr>
+  <tr>
+    <td><b>Table</b></td>
+    <td>Convert table to OTSL. (<a href="https://arxiv.org/pdf/2305.03393">Lysak et al., 2023</a>)</td>
+    <td><code>&lt;otsl&gt;</code></td>
+  </tr>
+  <tr>
+    <td rowspan="4"><b>Actions and Pipelines</b></td>
+    <td>OCR the text in a specific location: &lt;loc_155&gt;&lt;loc_233&gt;&lt;loc_206&gt;&lt;loc_237&gt;</td>
+    <td>-</td>
+  </tr>
+  <tr>
+    <td>Identify element at: &lt;loc_247&gt;&lt;loc_482&gt;&lt;10c_252&gt;&lt;loc_486&gt;</td>
+    <td>-</td>
+  </tr>
+  <tr>
+    <td>Find all 'text' elements on the page, retrieve all section headers.</td>
+    <td>-</td>
+  </tr>
+  <tr>
+    <td>Detect footer elements on the page.</td>
+    <td>-</td>
+  </tr>
+</table>
+# Model Architecture:
+The architecture of granite-docling-258m consists of the following components:
+(1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512).
+(2) Vision-language connector: pixel shuffle projector (as in idefics3)
+(3) Large language model: Granite 165M.
+We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling.
+The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models
+# Training Data
+Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities.
+In particular, we incorporate:
+* [**SynthCodeNet**](https://huggingface.co/datasets/ds4sd/SynthCodeNet) — a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages
+* [**SynthFormulaNet**](https://huggingface.co/datasets/ds4sd/SynthFormulaNet) — a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations
+* [**SynthChartNet**](https://huggingface.co/datasets/ds4sd/SynthChartNet) — synthetic chart images annotated with structured table outputs
+* [**DoclingMatix**](https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix) — a curated corpus of real-world document pages sampled from diverse domains
+# Infrastructure:
+We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
+Resources
+- ⭐️ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features
+- 🚀 Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/getting_started/
+- 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources

config.json ADDED Viewed

	@@ -0,0 +1,65 @@

+{
+  "architectures": [
+    "Idefics3ForConditionalGeneration"
+  ],
+  "bos_token_id": 100264,
+  "eos_token_id": 100257,
+  "image_token_id": 100270,
+  "model_type": "idefics3",
+  "pad_token_id": 100257,
+  "scale_factor": 4,
+  "text_config": {
+    "_name_or_path": "/models/granitev06_hf_ai4k_sft_data_v4",
+    "architectures": [
+      "LlamaForCausalLM"
+    ],
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "bos_token_id": 100264,
+    "eos_token_id": 100257,
+    "head_dim": 64,
+    "hidden_act": "silu",
+    "hidden_size": 576,
+    "initializer_range": 0.02,
+    "intermediate_size": 1536,
+    "max_position_embeddings": 8192,
+    "mlp_bias": false,
+    "model_type": "llama",
+    "num_attention_heads": 9,
+    "num_hidden_layers": 30,
+    "num_key_value_heads": 3,
+    "pad_token_id": 100257,
+    "pretraining_tp": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_scaling": null,
+    "rope_theta": 100000.0,
+    "torch_dtype": "bfloat16",
+    "use_cache": false,
+    "vocab_size": 100352
+  },
+  "tie_word_embeddings": false,
+  "torch_dtype": "bfloat16",
+  "transformers_version": "4.55.2",
+  "use_cache": true,
+  "vision_config": {
+    "attention_dropout": 0.0,
+    "hidden_act": "gelu_pytorch_tanh",
+    "hidden_size": 768,
+    "image_size": 512,
+    "initializer_range": 0.02,
+    "intermediate_size": 3072,
+    "layer_norm_eps": 1e-06,
+    "max_image_size": {
+      "longest_edge": 512
+    },
+    "model_type": "idefics3_vision",
+    "num_attention_heads": 12,
+    "num_channels": 3,
+    "num_hidden_layers": 12,
+    "patch_size": 16,
+    "size": {
+      "longest_edge": 512
+    }
+  },
+  "vocab_size": 100352
+}

generation_config.json ADDED Viewed

	@@ -0,0 +1,8 @@

+{
+  "_from_model_config": true,
+  "bos_token_id": 100264,
+  "eos_token_id": 100257,
+  "pad_token_id": 100257,
+  "transformers_version": "4.55.2",
+  "use_cache": false
+}

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:824a4c81f4b62308c26cb54bd4ee70c8ed8890874c7cfe7db9ba1176af023d97
+size 746304208