katerynaCh commited on
Commit
c1b6e1f
·
verified ·
1 Parent(s): 08dfa12

Upload folder using huggingface_hub

Browse files
Files changed (7) hide show
  1. README.md +178 -0
  2. bias.md +6 -0
  3. explainability.md +15 -0
  4. latex2html.py +448 -0
  5. postprocessing.py +95 -0
  6. privacy.md +10 -0
  7. safety.md +8 -0
README.md ADDED
@@ -0,0 +1,178 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # nemotron-parse Overview
2
+
3
+ nemotron-parse is a general purpose text-extraction model, specifically designed to handle documents. Given an image, nemotron-parse is able to extract formatted-text, with bounding-boxes and the corresponding semantic class. This has downstream benefits for several tasks such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of retriever systems, and enhancing document understanding pipelines.
4
+
5
+ This model is ready for commercial use.
6
+
7
+
8
+
9
+
10
+ ## License
11
+ GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/).
12
+
13
+
14
+ ## Deployment Geography:
15
+ Global
16
+
17
+ ## Use Case:
18
+ nemotron-parse will be capable of comprehensive text understanding and document structure understanding. It will be used in retriever and curator solutions. Its text extraction datasets and capabilities will help with LLM and VLM training, as well as improve run-time inference accuracy of VLMs.
19
+ The nemotron-parse model will perform text extraction from PDF and PPT documents. The nemotron-parse can classify the objects (title, section, caption, index, footnote, lists, tables, bibliography, image) in a given document, and provide bounding boxes with coordinates.
20
+
21
+
22
+ ## Release Date:
23
+ November 17, 2025
24
+
25
+
26
+ ## References
27
+ * https://huggingface.co/docs/transformers/en/model_doc/mbart
28
+
29
+
30
+ ## Model Architecture
31
+
32
+ ### Architecture Type :
33
+ Transformer-based vision-encoder-decoder model
34
+
35
+ ### Network Architecture
36
+ * Vision Encoder: ViT-H model (https://huggingface.co/nvidia/C-RADIO)<br>
37
+ * Adapter Layer: 1D convolutions & norms to compress dimensionality and sequence length of the latent space (13184 tokens to 3201 tokens)<br>
38
+ * Decoder: mBart [1] 10 blocks<br>
39
+ * Tokenizer: Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/)<br>
40
+ * Number of Parameters: < 1B<br>
41
+
42
+
43
+ ## Computational Load (For NVIDIA Models Only)
44
+ **Cumulative Compute:** 2.2e+22 <br>
45
+ **Estimated Energy and Emissions for Model Training:**
46
+ Energy Consumption: 7,827.46 kWh <br>
47
+ Carbon Emissions: 3.21 tCO2e <br>
48
+
49
+ ### Input
50
+ * Input Type: Image, Text<br>
51
+ * Input Type(s): Red, Green, Blue (RGB) + Prompt (String)
52
+ * Input Parameters: 2D, 1D
53
+ - Other Properties Related to Input:
54
+ - Max Input Resolution (Width, Height): 1648, 2048
55
+ - Min Input Resolution (Width, Height): 1024, 1280
56
+ - Channel Count: 3
57
+
58
+ ### Output
59
+ * Output Type: Text<br>
60
+ * Output Format: String
61
+ * Output Parameters: 1D
62
+ - Other Properties Related to Output:
63
+ - nemotron-parse output format is a string which encodes text content (formatted or not) as well as bounding boxes and class attributes.<br>
64
+
65
+ Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.<br>
66
+
67
+ ## Software Integration:
68
+
69
+ Runtime Engine(s): TensorRT-LLM
70
+
71
+ Supported Hardware Microarchitecture Compatibility: <br>
72
+ NVIDIA Hopper/NVIDIA Ampere/NVIDIA Turing<br>
73
+
74
+ Supported Operating System(s): Linux<br>
75
+
76
+ The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.<br>
77
+
78
+ ## Model Version:
79
+
80
+ V1.1
81
+
82
+ ## Quick Start
83
+
84
+ ### Install dependencies
85
+
86
+ ```bash
87
+ pip install -r requirements.txt
88
+ ```
89
+
90
+ ### Usage example
91
+
92
+ ```python
93
+ import torch
94
+ from PIL import Image, ImageDraw
95
+ from transformers import AutoModel, AutoProcessor, AutoTokenizer, AutoConfig, AutoImageProcessor, GenerationConfig
96
+ from postprocessing import extract_classes_bboxes, transform_bbox_to_original, postprocess_text
97
+
98
+ # Load model and processor
99
+ model_path = "nvidia/NVIDIA-Nemotron-Parse-v1.1" # Or use a local path
100
+ device = "cuda:0"
101
+
102
+ model = AutoModel.from_pretrained(
103
+ model_path,
104
+ trust_remote_code=True,
105
+ torch_dtype=torch.bfloat16
106
+ ).to(device).eval()
107
+ tokenizer = AutoTokenizer.from_pretrained(model_path)
108
+ processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
109
+
110
+ # Load image
111
+ image = Image.open("path/to/your/image.jpg")
112
+ task_prompt = "</s><s><predict_bbox><predict_classes><output_markdown>"
113
+
114
+ # Process image
115
+ inputs = processor(images=[image], text=task_prompt, return_tensors="pt").to(device)
116
+ prompt_ids = processor.tokenizer.encode(task_prompt, return_tensors="pt", add_special_tokens=False).cuda()
117
+
118
+
119
+ generation_config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True)
120
+ # Generate text
121
+ outputs = model.generate(**inputs, generation_config=generation_config)
122
+
123
+ # Decode the generated text
124
+ generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
125
+ classes, bboxes, texts = extract_classes_bboxes(generated_text)
126
+ bboxes = [transform_bbox_to_original(bbox, image.width, image.height) for bbox in bboxes]
127
+
128
+ # Specify output formats for postprocessing
129
+ table_format = 'latex' # latex | HTML | markdown
130
+ text_format = 'markdown' # markdown | plain
131
+ blank_text_in_figures = False # remove text inside 'Picture' class
132
+ texts = [postprocess_text(text, cls = cls, table_format=table_format, text_format=text_format, blank_text_in_figures=blank_text_in_figures) for text, cls in zip(texts, classes)]
133
+
134
+ for cl, bb, txt in zip(classes, bboxes, texts):
135
+ print(cl, ': ', txt)
136
+
137
+ draw = ImageDraw.Draw(image)
138
+ for bbox in bboxes:
139
+ draw.rectangle((bbox[0], bbox[1], bbox[2], bbox[3]), outline="red")
140
+ ```
141
+
142
+ ## Training, Testing, and Evaluation Datasets:
143
+
144
+
145
+ ### Training Dataset
146
+
147
+ nemotron-parse is first pre-trained on our internal datasets: human, synthetic and automated.
148
+ Data Modality:
149
+ *Text
150
+ *Image<br>
151
+ Data Collection Method by Dataset: Hybrid: Human, Synthetic, Automated
152
+ Labeling Method by Dataset: Hybrid: Human, Synthetic, Automated
153
+
154
+ ### Testing and Evaluation Dataset:
155
+
156
+ nemotron-parse is evaluated on multiple datasets for robustness, including public and internal dataset.
157
+ Data Collection Method by Dataset: Hybrid: Human, Synthetic, Automated
158
+ Labeling Method by Dataset: Hybrid: Human, Synthetic, Automated
159
+
160
+
161
+ ## Inference
162
+
163
+ Runtime Engine(s): TensorRT-LLM
164
+
165
+ Test Hardware: NVIDIA H100# Synchronization
166
+
167
+ ## Ethical Considerations
168
+
169
+ NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
170
+
171
+ Please report security vulnerabilities or NVIDIA AI Concerns here.
172
+
173
+ **You are responsible for ensuring that your use of NVIDIA AI Models complies with all applicable laws.**
174
+
175
+
176
+ ## Enterprise Support
177
+ Get access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).
178
+
bias.md ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ ## Bias
2
+ | Field | Response |
3
+ | :---- | :---- |
4
+ | Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
5
+ | Measures taken to mitigate against unwanted bias: | Not applicable |
6
+ | Bias Metric (If Measured): | None |
explainability.md ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ## Explainability
2
+
3
+ | Field | Response |
4
+ | :---- | :---- |
5
+ | Intended Task/Domain: | image to text |
6
+ | Model Type: | Transformer-based vision-encoder-decoder model|
7
+ | Intended Users: | Generative AI creators working with conversational AI models and image content. |
8
+ | Output: | Text |
9
+ | Describe how the model works: | Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers. |
10
+ | Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
11
+ | Technical Limitations & Mitigation: | The model demonstrates weakness to alignment-breaking attacks. Users are advised to deploy language model guardrails alongside this model to prevent potentially harmful outputs. The Model may generate answers that are inaccurate, omit key information, or include irrelevant or redundant text. |
12
+ | Verified to have met prescribed NVIDIA quality standards: | Yes |
13
+ | Performance Metrics: | Accuracy, Throughput, and User-side throughput |
14
+ | Potential Known Risks: | The model was optimized explicitly for instruction following and as such is more susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. This means that the model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources \-- either directly or indirectly by retrieval (e.g. via visiting a website) \-- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place. The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. |
15
+ | Licensing: | GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) |
latex2html.py ADDED
@@ -0,0 +1,448 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from bs4 import BeautifulSoup
3
+
4
+ def skip_whitespace(text, i):
5
+ """Advance index i past any whitespace."""
6
+ while i < len(text) and text[i].isspace():
7
+ i += 1
8
+ return i
9
+
10
+ def parse_braced_argument(text, i):
11
+ """
12
+ Given text and an index i that should point at an opening '{',
13
+ return a tuple (argument_content, new_index) where argument_content is the full
14
+ string inside the balanced braces and new_index is the position just after the matching '}'.
15
+ """
16
+ if i >= len(text) or text[i] != '{':
17
+ raise ValueError("Expected '{' at position {}".format(i))
18
+ i += 1 # skip the opening brace
19
+ start = i
20
+ level = 1
21
+ while i < len(text) and level > 0:
22
+ if text[i] == '{':
23
+ level += 1
24
+ elif text[i] == '}':
25
+ level -= 1
26
+ i += 1
27
+ if level != 0:
28
+ raise ValueError("Unbalanced braces starting at position {}".format(start-1))
29
+ # The argument content is from start to i-1 (excluding the closing brace)
30
+ return text[start:i-1], i
31
+
32
+ def parse_command(text, i):
33
+ """
34
+ Parse a \multirow or \multicolumn command starting at index i.
35
+ This function assumes the command has exactly three braced arguments.
36
+
37
+ It processes each argument recursively. For the third argument, after recursive processing,
38
+ it replaces any unescaped & with \&.
39
+
40
+ Returns a tuple (command_text, new_index) where command_text is the reconstructed command.
41
+ """
42
+ # Determine which command we have.
43
+ if text.startswith(r"\multirow", i):
44
+ command_name = r"\multirow"
45
+ i += len(r"\multirow")
46
+ elif text.startswith(r"\multicolumn", i):
47
+ command_name = r"\multicolumn"
48
+ i += len(r"\multicolumn")
49
+ else:
50
+ raise ValueError("Expected \\multirow or \\multicolumn at position {}".format(i))
51
+
52
+ # Skip whitespace between the command name and the first argument.
53
+ i = skip_whitespace(text, i)
54
+ args = []
55
+ # Expect exactly three arguments
56
+ for arg_index in range(3):
57
+ if i >= len(text) or text[i] != '{':
58
+ raise ValueError("Expected '{' for argument {} at position {}".format(arg_index+1, i))
59
+ arg_content, i = parse_braced_argument(text, i)
60
+ # Process the content recursively to catch nested commands
61
+ processed_arg = clean_multi_cells(arg_content)
62
+ if arg_index == 2:
63
+ # For the cell text (third argument), replace any unescaped &
64
+ processed_arg = re.sub(r'(?<!\\)&', r'\\&', processed_arg)
65
+ args.append(processed_arg)
66
+ # Only skip whitespace between arguments, not after the last one.
67
+ if arg_index < 2:
68
+ i = skip_whitespace(text, i)
69
+ # Reconstruct the full command with its three arguments
70
+ command_text = f"{command_name}{{{args[0]}}}{{{args[1]}}}{{{args[2]}}}"
71
+ return command_text, i
72
+
73
+ def clean_multi_cells(text):
74
+ """
75
+ Process an arbitrary LaTeX text string and look for occurrences of \multirow or \multicolumn commands.
76
+ When found, the command is parsed (handling nested braces and nested commands) and its third argument is fixed.
77
+
78
+ Returns the processed text.
79
+ """
80
+ result = []
81
+ i = 0
82
+ while i < len(text):
83
+ # Find next occurrence of either command.
84
+ idx_multi = text.find(r"\multirow", i)
85
+ idx_multiC = text.find(r"\multicolumn", i)
86
+
87
+ # Determine the next index among the two (if any)
88
+ if idx_multi == -1 and idx_multiC == -1:
89
+ result.append(text[i:])
90
+ break
91
+ if idx_multi == -1:
92
+ next_idx = idx_multiC
93
+ elif idx_multiC == -1:
94
+ next_idx = idx_multi
95
+ else:
96
+ next_idx = min(idx_multi, idx_multiC)
97
+
98
+ # Append text before the command (preserving any whitespace)
99
+ result.append(text[i:next_idx])
100
+ # Process the command starting at next_idx
101
+ command_text, new_index = parse_command(text, next_idx)
102
+ result.append(command_text)
103
+ i = new_index
104
+ return ''.join(result)
105
+
106
+ def parse_brace(s, pos):
107
+ """
108
+ Given a string s and an index pos pointing to an opening '{',
109
+ returns a tuple (content, new_pos) where content is the string
110
+ between the matching braces (handling nested braces) and new_pos is
111
+ the index just after the closing '}'.
112
+ """
113
+ if pos >= len(s) or s[pos] != '{':
114
+ raise ValueError("Expected '{' at position %d" % pos)
115
+ pos += 1 # skip the opening brace
116
+ content = ""
117
+ depth = 1
118
+ while pos < len(s) and depth:
119
+ char = s[pos]
120
+ if char == '{':
121
+ depth += 1
122
+ content += char
123
+ elif char == '}':
124
+ depth -= 1
125
+ if depth:
126
+ content += char
127
+ else:
128
+ content += char
129
+ pos += 1
130
+ if depth != 0:
131
+ raise ValueError("Unmatched '{' in string.")
132
+ return content, pos
133
+
134
+ def parse_command_merge(s, pos):
135
+ """
136
+ Parse a multirow or multicolumn command starting at s[pos]. If the content
137
+ of the command contains a nested command, then recursively parse the inner
138
+ command and merge its parameters with the outer ones. The merging is done
139
+ so that the outer multirow’s parameters (e.g. rowspan and width) are kept
140
+ while the inner command’s parameters (e.g. colspan, alignment) and its innermost
141
+ content are returned.
142
+
143
+ Returns a tuple (merged_dict, new_pos) where merged_dict is a dictionary
144
+ containing the combined parameters and new_pos is the updated index after
145
+ parsing the command.
146
+ """
147
+ if s.startswith(r"\multirow", pos):
148
+ newpos = pos + len(r"\multirow")
149
+ # Parse the three required arguments for multirow: rowspan, width, and content.
150
+ rowspan, newpos = parse_brace(s, newpos)
151
+ width, newpos = parse_brace(s, newpos)
152
+ content, newpos = parse_brace(s, newpos)
153
+ # Look for a nested command (either \multirow or \multicolumn) in the content.
154
+ index_mr = content.find(r"\multirow")
155
+ index_mc = content.find(r"\multicolumn")
156
+ if index_mr == -1 and index_mc == -1:
157
+ # No nested command found; return this command’s details.
158
+ return {"rowspan": rowspan.strip(), "width": width.strip(), "content": content.strip()}, newpos
159
+ else:
160
+ # At least one nested command is present. Pick the first occurrence.
161
+ indices = [i for i in (index_mr, index_mc) if i != -1]
162
+ first_index = min(indices)
163
+ # Parse the inner (nested) command from within the content.
164
+ inner, _ = parse_command_merge(content, first_index)
165
+ # Merge: keep the outer multirow’s parameters and add the inner ones.
166
+ merged = {"rowspan": rowspan.strip(), "width": width.strip()}
167
+ merged.update(inner)
168
+ return merged, newpos
169
+
170
+ elif s.startswith(r"\multicolumn", pos):
171
+ newpos = pos + len(r"\multicolumn")
172
+ # Parse the three arguments for multicolumn: colspan, alignment, and content.
173
+ colspan, newpos = parse_brace(s, newpos)
174
+ alignment, newpos = parse_brace(s, newpos)
175
+ content, newpos = parse_brace(s, newpos)
176
+ # Look for a nested command in the content.
177
+ index_mr = content.find(r"\multirow")
178
+ index_mc = content.find(r"\multicolumn")
179
+ if index_mr == -1 and index_mc == -1:
180
+ return {"colspan": colspan.strip(), "alignment": alignment.strip(), "content": content.strip()}, newpos
181
+ else:
182
+ indices = [i for i in (index_mr, index_mc) if i != -1]
183
+ first_index = min(indices)
184
+ inner, _ = parse_command_merge(content, first_index)
185
+ merged = {"colspan": colspan.strip(), "alignment": alignment.strip()}
186
+ merged.update(inner)
187
+ return merged, newpos
188
+
189
+ # Not a recognized command starting at pos.
190
+ return None, pos
191
+
192
+ def extract_merged_commands(s):
193
+ """
194
+ Scan through the LaTeX string s and extract merged multirow/multicolumn commands.
195
+ For each command found, if there is nesting the parser merges the outer and inner
196
+ parameters so that the final result includes both the rowspan (or width) and the colspan
197
+ (or alignment) along with the innermost content.
198
+
199
+ Returns a list of dictionaries.
200
+ """
201
+ pos = 0
202
+ results = []
203
+ while pos < len(s):
204
+ if s[pos] == '\\':
205
+ res, newpos = parse_command_merge(s, pos)
206
+ if res is not None:
207
+ results.append(res)
208
+ pos = newpos
209
+ continue
210
+ pos += 1
211
+ return results
212
+
213
+ def remove_tags(html, tags_to_remove):
214
+ soup = BeautifulSoup(html, "html.parser")
215
+ # Loop through the tags to remove
216
+ for tag_name in tags_to_remove:
217
+ for tag in soup.find_all(tag_name):
218
+ # Move the children of the tag to the parent tag
219
+ tag.unwrap() # This removes the tag but keeps its contents
220
+ # Return the modified HTML as a string
221
+ return str(soup)
222
+
223
+ def convert_th_to_td(html):
224
+ """Replace all th tags with td tags
225
+ """
226
+ soup = BeautifulSoup(html)
227
+ for th_tag in soup.find_all('th'):
228
+ th_tag.name = 'td'
229
+ return str(soup)
230
+
231
+ def replace_italic(text):
232
+ pattern = re.compile(r'(?<!\\)_(.*?)(?<!\\)_')
233
+
234
+ def italic_replacer(match):
235
+ # Get the text inside the underscores.
236
+ content = match.group(1)
237
+ # Remove the escape (backslash) from any escaped underscores inside.
238
+ content = content.replace(r'\_', '_')
239
+ return f"<i>{content}</i>"
240
+
241
+ # Replace all occurrences of the pattern using the replacer function.
242
+ return pattern.sub(italic_replacer, text)
243
+
244
+
245
+ def replace_bold(text):
246
+ pattern = re.compile(r'(?<!\\)\*\*(.*?)(?<!\\)\*\*')
247
+
248
+ def bold_replacer(match):
249
+ content = match.group(1)
250
+ # Unescape any escaped asterisks within the captured text.
251
+ content = content.replace(r'\*', '*')
252
+ return f"<b>{content}</b>"
253
+
254
+ return pattern.sub(bold_replacer, text)
255
+
256
+ def latex_table_to_html(latex_str, add_head_body = False):
257
+ # Pattern to match the entire tabular environment
258
+ table_pattern = r'\\begin{tabular}{([^}]*)}\s*(.*?)\\end{tabular}'
259
+
260
+ def process_cell(cell):
261
+ # Clean up cell content
262
+ cell = cell.strip()
263
+
264
+ out = extract_merged_commands(cell)
265
+ if len(out) > 0:
266
+ cell = process_cell(out[0]["content"])["content"]
267
+ rowspan = int(out[0].get("rowspan", "1"))
268
+ colspan = int(out[0].get("colspan", "1"))
269
+ return {
270
+ "content": cell,
271
+ "colspan": colspan,
272
+ "rowspan": rowspan
273
+ }
274
+
275
+ # Replace latex and markdown formatting with HTML tags
276
+ cell = re.sub(r'\$([^$]*)\$', r'\1', cell) # Remove math mode
277
+ cell = re.sub(r'\\textbf{([^}]*)}', r'<b>\1</b>', cell) # Convert latex bold
278
+ cell = re.sub(r'\\textit{([^}]*)}', r'<i>\1</i>', cell) # Convert latex italic
279
+ cell = replace_italic(cell)
280
+ cell = replace_bold(cell)
281
+ cell = cell.replace("\\$", "$").replace("\\%", "%").replace("\\newline", "\n").replace("\\textless", "<").replace("\\textgreater", ">").replace("\\*", "*").replace("\\_", "_").replace("\\backslash", "\\")
282
+
283
+ # Replace \& with & in the cell text
284
+ cell = cell.replace(r'\&', '&')
285
+ cell = cell.replace('<tbc>', '')
286
+ # Preserve newlines for downstream row-splitting; clean other tokens
287
+ cell = cell.replace('\\unknown', '').replace('\\<|unk|\\>', '').replace('<u>', '<underline>').replace('</u>', '</underline>')
288
+ return {
289
+ 'content': cell,
290
+ 'colspan': 1,
291
+ 'rowspan': 1
292
+ }
293
+
294
+ def split_row(input_string):
295
+ # Use a regular expression to split on '&' that is not preceded by a backslash
296
+ return re.split(r'(?<!\\)&', input_string)
297
+
298
+ def convert_table(match):
299
+ # Extract table content
300
+ format_spec, content = match.groups()
301
+
302
+ # Start building HTML table
303
+ html = ['<table>']
304
+
305
+ # Track cells for multirow
306
+ multirow_tracker = set()
307
+
308
+ # Process rows
309
+ rows = re.split(r'\\\\', content)
310
+ current_row = 0
311
+
312
+ for row in rows:
313
+ if not row.strip():
314
+ continue
315
+
316
+ row = row.strip()
317
+
318
+ # Skip \hline
319
+ if '\\hline' in row:
320
+ row = row.replace('\\hline', '')
321
+ if not row.strip():
322
+ continue
323
+
324
+ row = clean_multi_cells(row)
325
+
326
+ # Process cells
327
+ cells = split_row(row)
328
+ processed_cells = [process_cell(cell) for cell in cells]
329
+
330
+ # Build per-cell line lists splitting on newline or <br> tokens
331
+ def split_lines(text):
332
+ parts = re.split(r'(?:\n|<br\s*/?>)+', text)
333
+ return parts if parts is not None else ['']
334
+
335
+ line_lists = [split_lines(cell['content']) for cell in processed_cells]
336
+ max_lines = max(len(lst) for lst in line_lists) if line_lists else 1
337
+
338
+ # Emit one or more rows based on max_lines
339
+ for line_idx in range(max_lines):
340
+ if add_head_body:
341
+ if current_row == 0:
342
+ html.append(' <thead>')
343
+ if current_row == 1:
344
+ html.append(' <tbody>')
345
+ html.append(' <tr>')
346
+ current_col = 0
347
+
348
+ for col_idx, cell in enumerate(processed_cells):
349
+ content_segment = line_lists[col_idx][line_idx] if line_idx < len(line_lists[col_idx]) else ''
350
+
351
+ attrs = []
352
+ if cell['colspan'] > 1:
353
+ attrs.append(f'colspan="{cell["colspan"]}"')
354
+ # Only apply original rowspan to the first emitted line
355
+ if cell['rowspan'] > 1 and line_idx == 0:
356
+ attrs.append(f'rowspan="{cell["rowspan"]}"')
357
+ for r in range(current_row + 1, current_row + cell['rowspan']):
358
+ for c in range(current_col, current_col + cell['colspan']):
359
+ multirow_tracker.add((r, c))
360
+
361
+ # If this position is covered by a prior rowspan, skip rendering a duplicate cell
362
+ if cell['rowspan'] > 1 and line_idx > 0:
363
+ current_col += cell['colspan']
364
+ continue
365
+
366
+ if (current_row, current_col) in multirow_tracker and content_segment == '' and cell["colspan"] == 1 and cell["rowspan"] == 1:
367
+ current_col += cell['colspan']
368
+ continue
369
+
370
+ attr_str = ' ' + ' '.join(attrs) if attrs else ''
371
+ cell_tag = 'td'
372
+ html.append(f' <{cell_tag}{attr_str}>{content_segment}</{cell_tag}>')
373
+ current_col += cell['colspan']
374
+
375
+ if add_head_body and current_row == 0:
376
+ html.append(' </thead>')
377
+ html.append(' </tr>')
378
+ current_row += 1
379
+ if add_head_body:
380
+ html.append(' </tbody>')
381
+ html.append('</table>')
382
+ return '\n'.join(html)
383
+
384
+ # Convert all tabular environments in the input
385
+ return re.sub(table_pattern, convert_table, latex_str, flags=re.DOTALL)
386
+ def convert_single_table(table):
387
+ """
388
+ Convert a single HTML table to Markdown format.
389
+
390
+ Args:
391
+ table: BeautifulSoup table element
392
+
393
+ Returns:
394
+ str: Markdown table string
395
+ """
396
+ markdown_lines = []
397
+ rows = table.find_all('tr')
398
+
399
+ for i, row in enumerate(rows):
400
+ cells = row.find_all(['td', 'th'])
401
+ if not cells:
402
+ continue
403
+
404
+ # Convert cells to text, handling nested elements
405
+ row_data = []
406
+ for cell in cells:
407
+ # Get text content, handling nested elements
408
+ cell_text = cell.get_text(separator=' ', strip=True)
409
+ # Escape pipe characters
410
+ cell_text = cell_text.replace('|', '\\|')
411
+ row_data.append(cell_text)
412
+
413
+ # Add row to markdown
414
+ markdown_lines.append('| ' + ' | '.join(row_data) + ' |')
415
+
416
+ # Add separator after header row
417
+ if i == 0:
418
+ separator = '| ' + ' | '.join(['---'] * len(cells)) + ' |'
419
+ markdown_lines.append(separator)
420
+
421
+ return '\n'.join(markdown_lines)
422
+ def convert_html_tables_to_markdown(html_content):
423
+ """
424
+ Find all HTML tables and convert them to Markdown while preserving all other content.
425
+
426
+ Args:
427
+ html_content (str): HTML content that may contain tables
428
+
429
+ Returns:
430
+ str: HTML content with tables converted to Markdown
431
+ """
432
+ soup = BeautifulSoup(html_content, 'html.parser')
433
+
434
+ # Find all tables
435
+ tables = soup.find_all('table')
436
+
437
+ if not tables:
438
+ return html_content # Return original content unchanged
439
+
440
+ # Convert each table to markdown and replace it
441
+ for table in tables:
442
+ markdown_table = convert_single_table(table)
443
+
444
+ # Create a new element to replace the table
445
+ replacement = soup.new_string('\n' + markdown_table + '\n')
446
+ table.replace_with(replacement)
447
+
448
+ return str(soup)
postprocessing.py ADDED
@@ -0,0 +1,95 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from latex2html import convert_html_tables_to_markdown, latex_table_to_html
3
+
4
+ def extract_classes_bboxes(text: str):
5
+ _re_extract_class_bbox = re.compile(r'<x_(\d+(?:\.\d+)?)><y_(\d+(?:\.\d+)?)>(.*?)<x_(\d+(?:\.\d+)?)><y_(\d+(?:\.\d+)?)><class_([^>]+)>', re.DOTALL)
6
+ classes = []
7
+ bboxes = []
8
+ texts = []
9
+ for m in _re_extract_class_bbox.finditer(text):
10
+ x1, y1, text, x2, y2, cls = m.groups()
11
+ classes.append(cls)
12
+ bboxes.append((float(x1), float(y1), float(x2), float(y2)))
13
+ texts.append(text)
14
+
15
+ # TODO: Remove when fixed
16
+ classes = [
17
+ "Formula" if cls == "Inline-formula" else cls for cls in classes
18
+ ]
19
+ assert "Page-number" not in classes
20
+
21
+ return classes, bboxes, texts
22
+
23
+ def transform_bbox_to_original(bbox, original_width, original_height, target_w=1648, target_h=2048):
24
+ # Replicate exact resize logic
25
+ aspect_ratio = original_width / original_height
26
+ new_height = original_height
27
+ new_width = original_width
28
+
29
+ if original_height > target_h:
30
+ new_height = target_h
31
+ new_width = int(new_height * aspect_ratio)
32
+
33
+ if new_width > target_w:
34
+ new_width = target_w
35
+ new_height = int(new_width / aspect_ratio)
36
+
37
+ resized_width = new_width
38
+ resized_height = new_height
39
+
40
+ # Calculate padding
41
+ pad_left = (target_w - resized_width) // 2
42
+ pad_top = (target_h - resized_height) // 2
43
+
44
+ # # Transform: use the ACTUAL resized dimensions, not the scale
45
+ # # X coords
46
+ left = ((bbox[0] * target_w) - pad_left) * original_width / resized_width
47
+ right = ((bbox[2] * target_w) - pad_left) * original_width / resized_width
48
+
49
+ # # Y coords - using original_height / resized_height directly
50
+ top = ((bbox[1] * target_h) - pad_top) * original_height / resized_height
51
+ bottom = ((bbox[3] * target_h) - pad_top) * original_height / resized_height
52
+
53
+ return left, top, right, bottom
54
+
55
+ def postprocess_text(text, cls = 'Text', text_format='markdown', table_format='latex', blank_text_in_figures=False):
56
+ assert text_format in ['markdown', 'plain'], 'Unknown text format. Supported: markdown | plain'
57
+ assert table_format in ['latex', 'HTML', 'markdown'], 'Unknown table format. Supported: latex | HTML | markdown'
58
+ if cls != 'Table':
59
+ if text_format == 'plain':
60
+ text = convert_mmd_to_plain_text_ours(text)
61
+ elif table_format == 'HTML':
62
+ text = latex_table_to_html(text)
63
+ elif table_format == 'markdown':
64
+ text = convert_html_tables_to_markdown(latex_table_to_html(text))
65
+ if blank_text_in_figures and cls == 'Picture':
66
+ text = ''
67
+ return text
68
+
69
+ def remove_nemotron_formatting(text):
70
+ text = text.replace('<tbc>', '')
71
+ mmd_text = mmd_text.replace('\\<|unk|\\>', '')
72
+ mmd_text = mmd_text.replace('\\unknown', '')
73
+
74
+ def convert_mmd_to_plain_text_ours(mmd_text):
75
+ mmd_text = re.sub(r'<sup>(.*?)</sup>', r'^{\\1}', mmd_text, flags=re.DOTALL)
76
+ mmd_text = re.sub(r'<sub>(.*?)</sub>', r'_{\\1}', mmd_text, flags=re.DOTALL)
77
+ mmd_text = mmd_text.replace('<br>', '\n')
78
+
79
+ # Remove headers (e.g., ##)
80
+ mmd_text = re.sub(r'#+\s', '', mmd_text)
81
+
82
+ # Remove bold (e.g., **)
83
+ mmd_text = re.sub(r'\*\*(.*?)\*\*', r'\1', mmd_text)
84
+ #mmd_text = mmd_text.replace("**","")
85
+ # Remove italic (e.g., *)
86
+ mmd_text = re.sub(r'\*(.*?)\*', r'\1', mmd_text)
87
+ # Remove emphasized text formatting (e.g., _)
88
+ mmd_text = re.sub(r'(?<!\w)_([^_]+)_', r'\1', mmd_text)
89
+
90
+ # Remove formulas inside paragraphs (e.g., \(R_{ij}(P^{a})=0\))
91
+ #mmd_text = re.sub(r'\\\((.*?)\\\)', '', mmd_text)
92
+
93
+ # Remove asterisk in lists
94
+ #mmd_text = re.sub(r'^\*\s', '', mmd_text, flags=re.MULTILINE)
95
+ return mmd_text.strip()
privacy.md ADDED
@@ -0,0 +1,10 @@
 
 
 
 
 
 
 
 
 
 
 
1
+ Field | Response |
2
+ | :---- | :---- |
3
+ | Generatable or reverse engineerable personal data? | No |
4
+ | Personal data used to create this model? | No |
5
+ | Was consent obtained for any personal data used? | Not Applicable |
6
+ | How often is the dataset reviewed? | Before Release |
7
+ | Is there provenance for all datasets used in training? | Yes |
8
+ | Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
9
+ | Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |
10
+ | Applicable Privacy Policy | [NVIDIA Privacy Policy](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |
safety.md ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ | Field | Response |
2
+ | :---- | :---- |
3
+ | Model Application Field(s): | Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning, Customer Service |
4
+ | Describe the life critical impact (if present). | Not Applicable |
5
+ | Use Case Restrictions: | GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/)|
6
+ | Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
7
+
8
+ **You are responsible for ensuring that your use of NVIDIA AI Models complies with all applicable laws.**