zahra-kolagar commited on
Commit
50077c6
·
unverified ·
1 Parent(s): 8e1e6fe

Use untied branch as default

Browse files
Files changed (5) hide show
  1. .gitattributes +1 -0
  2. README.md +331 -0
  3. config.json +65 -0
  4. generation_config.json +8 -0
  5. model.safetensors +3 -0
.gitattributes CHANGED
@@ -33,3 +33,4 @@ saved_model/**/* filter=lfs diff=lfs merge=lfs -text
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
 
 
33
  *.zip filter=lfs diff=lfs merge=lfs -text
34
  *.zst filter=lfs diff=lfs merge=lfs -text
35
  *tfevents* filter=lfs diff=lfs merge=lfs -text
36
+ granite_docling.png filter=lfs diff=lfs merge=lfs -text
README.md ADDED
@@ -0,0 +1,331 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ license: apache-2.0
3
+ datasets:
4
+ - ds4sd/SynthCodeNet
5
+ - ds4sd/SynthFormulaNet
6
+ - ds4sd/SynthChartNet
7
+ - HuggingFaceM4/DoclingMatix
8
+ tags:
9
+ - text-generation
10
+ - documents
11
+ - code
12
+ - formula
13
+ - chart
14
+ - ocr
15
+ - layout
16
+ - table
17
+ - document-parse
18
+ - docling
19
+ - granite
20
+ - extraction
21
+ - math
22
+ ---
23
+
24
+ # granite-docling-258m
25
+
26
+ **Model Summary**: Granite Docling is a multimodal Image-Text-to-Text model engineered for efficient document conversion. It preserves the core features of Docling while maintaining seamless integration with [DoclingDocuments](https://docling-project.github.io/docling/) to ensure full compatibility.
27
+
28
+ Granite Docling 258M builds upon the IDEFICS3 architecture, but introduces two key modifications: it replaces the vision encoder with siglip2-base-patch16-512 and substitutes the language model with a Granite 165M LLM.
29
+
30
+ - **Developed by**: IBM Research
31
+ - **Model type**: Multi-modal model (image+text-to-text)
32
+ - **Language(s)**: English (NLP)
33
+ - **License**: [Apache 2.0](https://www.apache.org/licenses/LICENSE-2.0)
34
+ - **Release Date**: September 17, 2025
35
+
36
+ Granite-docling-258M is fully integrated into the Docling pipelines, carrying over existing [features](https://huggingface.co/ds4sd/SmolDocling-256M-preview) while introducing a number of powerful new features, including:
37
+
38
+ - 🔢 Enhanced Equation Recognition: More accurate detection and formatting of mathematical formulas
39
+ - 🧩 Flexible Inference Modes: Choose between full-page inference, bbox-guided region inference
40
+ - 🧘 Improved Stability: Tends to avoid infinite loops more effectively
41
+ - 🧮 Enhanceed Inline Equations: Better inline math recognition
42
+ - 🧾 Document Element QA: Answer questions about a document’s structure such as the presence and order of document elements
43
+ - 🌍 Japanese, Arabic and Chinese support (_experimental_)
44
+
45
+
46
+ ## Evaluations
47
+
48
+ <table>
49
+ <thead>
50
+ <tr>
51
+ <th></th>
52
+ <th><b>smoldocling-256m-preview</b></th>
53
+ <th><b>granite-docling-258m</b></th>
54
+ </tr>
55
+ </thead>
56
+ <tbody>
57
+ <tr><td colspan="3"><b>Layout</b></td></tr>
58
+ <tr><td>MAP ↑</td><td>0.21</td><td><b>0.28</b></td></tr>
59
+ <tr><td>F1 ↑</td><td>0.79</td><td><b>0.85</b></td></tr>
60
+ <tr><td>Precision ↑</td><td>0.86</td><td><b>0.87</b></td></tr>
61
+ <tr><td>Recall ↑</td><td>0.82</td><td><b>0.89</b></td></tr>
62
+ <tr><td colspan="3"><b>Full Page OCR</b></td></tr>
63
+ <tr><td>Edit-distance ↓</td><td>0.48 (0.46)</td><td><b>0.46</b> (<b>0.44</b>)</td></tr>
64
+ <tr><td>F1 ↑</td><td><b>0.80</b> (0.76)</td><td>0.75 (<b>0.78</b>)</td></tr>
65
+ <tr><td>Precision ↑</td><td><b>0.89</b> (0.85)</td><td>0.81 (0.85)</td></tr>
66
+ <tr><td>Recall ↑</td><td><b>0.79</b> (0.74)</td><td>0.73 (<b>0.77</b>)</td></tr>
67
+ <tr><td>BLEU ↑</td><td><b>0.58</b> (0.54)</td><td>0.56 (<b>0.59</b>)</td></tr>
68
+ <tr><td>Meteor ↑</td><td>0.67 (0.67)</td><td>0.67 (<b>0.70</b>)</td></tr>
69
+ <tr><td colspan="3"><b>Code Recognition</b></td></tr>
70
+ <tr><td>Edit-distance ↓</td><td>0.114</td><td><b>0.013</b></td></tr>
71
+ <tr><td>F1 ↑</td><td>0.915</td><td><b>0.988</b></td></tr>
72
+ <tr><td>Precision ↑</td><td>0.94</td><td><b>0.99</b></td></tr>
73
+ <tr><td>Recall ↑</td><td>0.909</td><td><b>0.988</b></td></tr>
74
+ <tr><td>BLEU ↑</td><td>0.875</td><td><b>0.983</b></td></tr>
75
+ <tr><td>Meteor ↑</td><td>0.889</td><td><b>0.986</b></td></tr>
76
+ <tr><td colspan="3"><b>Equation Recognition</b></td></tr>
77
+ <tr><td>Edit-distance ↓</td><td>0.119</td><td><b>0.073</b></td></tr>
78
+ <tr><td>F1 ↑</td><td>0.947</td><td><b>0.968</b></td></tr>
79
+ <tr><td>Precision ↑</td><td>0.959</td><td><b>0.968</b></td></tr>
80
+ <tr><td>Recall ↑</td><td>0.941</td><td><b>0.969</b></td></tr>
81
+ <tr><td>BLEU ↑</td><td>0.824</td><td><b>0.893</b></td></tr>
82
+ <tr><td>Meteor ↑</td><td>0.878</td><td><b>0.927</b></td></tr>
83
+ <tr><td colspan="3"><b>Table Recognition (FinTabNet 150dpi)</b></td></tr>
84
+ <tr><td>TEDS (structure) ↑</td><td>0.82</td><td><b>0.97</b></td></tr>
85
+ <tr><td>TEDS (w/content) ↑</td><td>0.76</td><td><b>0.96</b></td></tr>
86
+ <tr><td colspan="3"><b>Other Benchmarks</b></td></tr>
87
+ <tr><td>MMStar ↑</td><td>0.17</td><td><b>0.3</b></td></tr>
88
+ <tr><td>OCRBench ↑</td><td>338</td><td><b>500</b></td></tr>
89
+ </table>
90
+
91
+ ## Getting started
92
+
93
+ You can use **transformers**, **vllm**, or **onnx** to perform inference, and [Docling](https://github.com/docling-project/docling) to convert results to variety of output formats (md, html, etc.):
94
+
95
+ <details>
96
+ <summary>📄 Single page image inference using Tranformers 🤖</summary>
97
+
98
+ ```python
99
+ # Prerequisites:
100
+ # pip install torch
101
+ # pip install docling_core
102
+ # pip install transformers
103
+
104
+ import torch
105
+ from docling_core.types.doc import DoclingDocument
106
+ from docling_core.types.doc.document import DocTagsDocument
107
+ from transformers import AutoProcessor, AutoModelForVision2Seq
108
+ from transformers.image_utils import load_image
109
+ from pathlib import Path
110
+
111
+ DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
112
+
113
+ # Load images
114
+ image = load_image("https://upload.wikimedia.org/wikipedia/commons/7/76/GazettedeFrance.jpg")
115
+
116
+ # Initialize processor and model
117
+ processor = AutoProcessor.from_pretrained("ibm-granite/granite-docling-258M")
118
+ model = AutoModelForVision2Seq.from_pretrained(
119
+ "ibm-granite/granite-docling-258M",
120
+ torch_dtype=torch.bfloat16,
121
+ _attn_implementation="flash_attention_2" if DEVICE == "cuda" else "eager",
122
+ ).to(DEVICE)
123
+
124
+ # Create input messages
125
+ messages = [
126
+ {
127
+ "role": "user",
128
+ "content": [
129
+ {"type": "image"},
130
+ {"type": "text", "text": "Convert this page to docling."}
131
+ ]
132
+ },
133
+ ]
134
+
135
+ # Prepare inputs
136
+ prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
137
+ inputs = processor(text=prompt, images=[image], return_tensors="pt")
138
+ inputs = inputs.to(DEVICE)
139
+
140
+ # Generate outputs
141
+ generated_ids = model.generate(**inputs, max_new_tokens=8192)
142
+ prompt_length = inputs.input_ids.shape[1]
143
+ trimmed_generated_ids = generated_ids[:, prompt_length:]
144
+ doctags = processor.batch_decode(
145
+ trimmed_generated_ids,
146
+ skip_special_tokens=False,
147
+ )[0].lstrip()
148
+
149
+ # Populate document
150
+ doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [image])
151
+ print(doctags)
152
+ # create a docling document
153
+ doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
154
+
155
+ # export as any format
156
+ # HTML
157
+ # Path("Out/").mkdir(parents=True, exist_ok=True)
158
+ # output_path_html = Path("Out/") / "example.html"
159
+ # doc.save_as_html(output_path_html)
160
+ # MD
161
+ print(doc.export_to_markdown())
162
+ ```
163
+ </details>
164
+
165
+
166
+ <details>
167
+ <summary> 🚀 Fast Batch Inference Using VLLM</summary>
168
+
169
+ ```python
170
+ # Prerequisites:
171
+ # pip install vllm
172
+ # pip install docling_core
173
+ # place page images you want to convert into "img/" dir
174
+
175
+ import time
176
+ import os
177
+ from vllm import LLM, SamplingParams
178
+ from PIL import Image
179
+ from docling_core.types.doc import DoclingDocument
180
+ from docling_core.types.doc.document import DocTagsDocument
181
+ from pathlib import Path
182
+
183
+ # Configuration
184
+ MODEL_PATH = "ibm-granite/granite-docling-258M"
185
+ IMAGE_DIR = "img/" # Place your page images here
186
+ OUTPUT_DIR = "out/"
187
+ PROMPT_TEXT = "Convert page to docling."
188
+
189
+ # Ensure output directory exists
190
+ os.makedirs(OUTPUT_DIR, exist_ok=True)
191
+
192
+ # Initialize LLM
193
+ llm = LLM(model=MODEL_PATH, limit_mm_per_prompt={"image": 1})
194
+
195
+ sampling_params = SamplingParams(
196
+ temperature=0.0,
197
+ max_tokens=8192
198
+ )
199
+
200
+ # Load and prepare all images and prompts up front
201
+ batched_inputs = []
202
+ image_names = []
203
+
204
+ for img_file in sorted(os.listdir(IMAGE_DIR)):
205
+ if img_file.lower().endswith((".png", ".jpg", ".jpeg")):
206
+ img_path = os.path.join(IMAGE_DIR, img_file)
207
+ with Image.open(img_path) as im:
208
+ image = im.convert("RGB")
209
+
210
+ prompt = (
211
+ f"<|start_of_role|>user<|end_of_role|><image>{PROMPT_TEXT}<|end_of_text|>\n"
212
+ f"<|start_of_role|>assistant<|end_of_role|>"
213
+ )
214
+ batched_inputs.append({"prompt": prompt, "multi_modal_data": {"image": image}})
215
+ image_names.append(os.path.splitext(img_file)[0])
216
+
217
+ # Run batch inference
218
+ start_time = time.time()
219
+ outputs = llm.generate(batched_inputs, sampling_params=sampling_params)
220
+
221
+ # Postprocess all results
222
+ for img_fn, output, input_data in zip(image_names, outputs, batched_inputs):
223
+ doctags = output.outputs[0].text
224
+ output_path_dt = Path(OUTPUT_DIR) / f"{img_fn}.dt"
225
+ output_path_md = Path(OUTPUT_DIR) / f"{img_fn}.md"
226
+
227
+ with open(output_path_dt, "w", encoding="utf-8") as f:
228
+ f.write(doctags)
229
+
230
+ # Convert to DoclingDocument and save markdown
231
+ doctags_doc = DocTagsDocument.from_doctags_and_image_pairs([doctags], [input_data["multi_modal_data"]["image"]])
232
+ doc = DoclingDocument.load_from_doctags(doctags_doc, document_name="Document")
233
+ doc.save_as_markdown(output_path_md)
234
+
235
+ print(f"Total time: {time.time() - start_time:.2f} sec")
236
+
237
+ ```
238
+ </details>
239
+
240
+ 💻 Local inference on Apple Silicon with MLX: [see here](https://huggingface.co/ibm-granite/granite-docling-258M-mlx)
241
+
242
+
243
+ ## Supported Instructions
244
+
245
+ <table>
246
+ <tr>
247
+ <th>Description</th>
248
+ <th>Instruction</th>
249
+ <th>Short Instruction</th>
250
+ </tr>
251
+ <tr>
252
+ <td><b>Full conversion</b></td>
253
+ <td>Convert this page to docling.</td>
254
+ <td>-</td>
255
+ </tr>
256
+ <tr>
257
+ <td><b>Chart</b></td>
258
+ <td>Convert chart to table.</td>
259
+ <td><code>&lt;chart&gt;</code></td>
260
+ </tr>
261
+ <tr>
262
+ <td><b>Formula</b></td>
263
+ <td>Convert formula to LaTeX.</td>
264
+ <td><code>&lt;formula&gt;</code></td>
265
+ </tr>
266
+ <tr>
267
+ <td><b>Code</b></td>
268
+ <td>Convert code to text.</td>
269
+ <td><code>&lt;code&gt;</code></td>
270
+ </tr>
271
+ <tr>
272
+ <td><b>Table</b></td>
273
+ <td>Convert table to OTSL. (<a href="https://arxiv.org/pdf/2305.03393">Lysak et al., 2023</a>)</td>
274
+ <td><code>&lt;otsl&gt;</code></td>
275
+ </tr>
276
+ <tr>
277
+ <td rowspan="4"><b>Actions and Pipelines</b></td>
278
+ <td>OCR the text in a specific location: &lt;loc_155&gt;&lt;loc_233&gt;&lt;loc_206&gt;&lt;loc_237&gt;</td>
279
+ <td>-</td>
280
+ </tr>
281
+ <tr>
282
+ <td>Identify element at: &lt;loc_247&gt;&lt;loc_482&gt;&lt;10c_252&gt;&lt;loc_486&gt;</td>
283
+ <td>-</td>
284
+ </tr>
285
+ <tr>
286
+ <td>Find all 'text' elements on the page, retrieve all section headers.</td>
287
+ <td>-</td>
288
+ </tr>
289
+ <tr>
290
+ <td>Detect footer elements on the page.</td>
291
+ <td>-</td>
292
+ </tr>
293
+ </table>
294
+
295
+
296
+
297
+ # Model Architecture:
298
+
299
+ The architecture of granite-docling-258m consists of the following components:
300
+
301
+ (1) Vision encoder: [siglip2-base-patch16-512](https://huggingface.co/google/siglip2-base-patch16-512).
302
+
303
+ (2) Vision-language connector: pixel shuffle projector (as in idefics3)
304
+
305
+ (3) Large language model: Granite 165M.
306
+
307
+ We built upon [Idefics3](https://huggingface.co/docs/transformers/en/model_doc/idefics3) to train our model. We incorporated DocTags into our LLM’s supervised fine-tuning (SFT) data to help the model become familiar with the format, enabling faster convergence and mitigating issues previously observed with SmolDocling.
308
+ The model was trained using the [nanoVLM](https://github.com/huggingface/nanoVLM) framework, which provides a lightweight and efficient training setup for vision-language models
309
+
310
+
311
+ # Training Data
312
+
313
+ Our training corpus consists of two principal sources: (1) publicly available datasets and (2) internally constructed synthetic datasets designed to elicit specific document understanding capabilities.
314
+
315
+ In particular, we incorporate:
316
+
317
+ * [**SynthCodeNet**](https://huggingface.co/datasets/ds4sd/SynthCodeNet) — a large-scale collection of synthetically rendered code snippets spanning over 50 programming languages
318
+ * [**SynthFormulaNet**](https://huggingface.co/datasets/ds4sd/SynthFormulaNet) — a dataset of synthetic mathematical expressions paired with ground-truth LaTeX representations
319
+ * [**SynthChartNet**](https://huggingface.co/datasets/ds4sd/SynthChartNet) — synthetic chart images annotated with structured table outputs
320
+ * [**DoclingMatix**](https://huggingface.co/datasets/HuggingFaceM4/DoclingMatix) — a curated corpus of real-world document pages sampled from diverse domains
321
+
322
+
323
+ # Infrastructure:
324
+
325
+ We train granite-docling-258m using IBM's super computing cluster, Blue Vela, which is outfitted with NVIDIA H100 GPUs. This cluster provides a scalable and efficient infrastructure for training our models over thousands of GPUs.
326
+
327
+ Resources
328
+
329
+ - ⭐️ Learn about the latest updates with Docling: https://docling-project.github.io/docling/#features
330
+ - 🚀 Get started with Docling concepts, integrations and tutorials: https://docling-project.github.io/docling/getting_started/
331
+ - 💡 Learn about the latest Granite learning resources: https://ibm.biz/granite-learning-resources
config.json ADDED
@@ -0,0 +1,65 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "architectures": [
3
+ "Idefics3ForConditionalGeneration"
4
+ ],
5
+ "bos_token_id": 100264,
6
+ "eos_token_id": 100257,
7
+ "image_token_id": 100270,
8
+ "model_type": "idefics3",
9
+ "pad_token_id": 100257,
10
+ "scale_factor": 4,
11
+ "text_config": {
12
+ "_name_or_path": "/models/granitev06_hf_ai4k_sft_data_v4",
13
+ "architectures": [
14
+ "LlamaForCausalLM"
15
+ ],
16
+ "attention_bias": false,
17
+ "attention_dropout": 0.0,
18
+ "bos_token_id": 100264,
19
+ "eos_token_id": 100257,
20
+ "head_dim": 64,
21
+ "hidden_act": "silu",
22
+ "hidden_size": 576,
23
+ "initializer_range": 0.02,
24
+ "intermediate_size": 1536,
25
+ "max_position_embeddings": 8192,
26
+ "mlp_bias": false,
27
+ "model_type": "llama",
28
+ "num_attention_heads": 9,
29
+ "num_hidden_layers": 30,
30
+ "num_key_value_heads": 3,
31
+ "pad_token_id": 100257,
32
+ "pretraining_tp": 1,
33
+ "rms_norm_eps": 1e-05,
34
+ "rope_scaling": null,
35
+ "rope_theta": 100000.0,
36
+ "torch_dtype": "bfloat16",
37
+ "use_cache": false,
38
+ "vocab_size": 100352
39
+ },
40
+ "tie_word_embeddings": false,
41
+ "torch_dtype": "bfloat16",
42
+ "transformers_version": "4.55.2",
43
+ "use_cache": true,
44
+ "vision_config": {
45
+ "attention_dropout": 0.0,
46
+ "hidden_act": "gelu_pytorch_tanh",
47
+ "hidden_size": 768,
48
+ "image_size": 512,
49
+ "initializer_range": 0.02,
50
+ "intermediate_size": 3072,
51
+ "layer_norm_eps": 1e-06,
52
+ "max_image_size": {
53
+ "longest_edge": 512
54
+ },
55
+ "model_type": "idefics3_vision",
56
+ "num_attention_heads": 12,
57
+ "num_channels": 3,
58
+ "num_hidden_layers": 12,
59
+ "patch_size": 16,
60
+ "size": {
61
+ "longest_edge": 512
62
+ }
63
+ },
64
+ "vocab_size": 100352
65
+ }
generation_config.json ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "_from_model_config": true,
3
+ "bos_token_id": 100264,
4
+ "eos_token_id": 100257,
5
+ "pad_token_id": 100257,
6
+ "transformers_version": "4.55.2",
7
+ "use_cache": false
8
+ }
model.safetensors ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ version https://git-lfs.github.com/spec/v1
2
+ oid sha256:824a4c81f4b62308c26cb54bd4ee70c8ed8890874c7cfe7db9ba1176af023d97
3
+ size 746304208