Upload folder using huggingface_hub
Browse files- README.md +178 -0
- bias.md +6 -0
- explainability.md +15 -0
- latex2html.py +448 -0
- postprocessing.py +95 -0
- privacy.md +10 -0
- safety.md +8 -0
README.md
ADDED
|
@@ -0,0 +1,178 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
# nemotron-parse Overview
|
| 2 |
+
|
| 3 |
+
nemotron-parse is a general purpose text-extraction model, specifically designed to handle documents. Given an image, nemotron-parse is able to extract formatted-text, with bounding-boxes and the corresponding semantic class. This has downstream benefits for several tasks such as increasing the availability of training-data for Large Language Models (LLMs), improving the accuracy of retriever systems, and enhancing document understanding pipelines.
|
| 4 |
+
|
| 5 |
+
This model is ready for commercial use.
|
| 6 |
+
|
| 7 |
+
|
| 8 |
+
|
| 9 |
+
|
| 10 |
+
## License
|
| 11 |
+
GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/).
|
| 12 |
+
|
| 13 |
+
|
| 14 |
+
## Deployment Geography:
|
| 15 |
+
Global
|
| 16 |
+
|
| 17 |
+
## Use Case:
|
| 18 |
+
nemotron-parse will be capable of comprehensive text understanding and document structure understanding. It will be used in retriever and curator solutions. Its text extraction datasets and capabilities will help with LLM and VLM training, as well as improve run-time inference accuracy of VLMs.
|
| 19 |
+
The nemotron-parse model will perform text extraction from PDF and PPT documents. The nemotron-parse can classify the objects (title, section, caption, index, footnote, lists, tables, bibliography, image) in a given document, and provide bounding boxes with coordinates.
|
| 20 |
+
|
| 21 |
+
|
| 22 |
+
## Release Date:
|
| 23 |
+
November 17, 2025
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
## References
|
| 27 |
+
* https://huggingface.co/docs/transformers/en/model_doc/mbart
|
| 28 |
+
|
| 29 |
+
|
| 30 |
+
## Model Architecture
|
| 31 |
+
|
| 32 |
+
### Architecture Type :
|
| 33 |
+
Transformer-based vision-encoder-decoder model
|
| 34 |
+
|
| 35 |
+
### Network Architecture
|
| 36 |
+
* Vision Encoder: ViT-H model (https://huggingface.co/nvidia/C-RADIO)<br>
|
| 37 |
+
* Adapter Layer: 1D convolutions & norms to compress dimensionality and sequence length of the latent space (13184 tokens to 3201 tokens)<br>
|
| 38 |
+
* Decoder: mBart [1] 10 blocks<br>
|
| 39 |
+
* Tokenizer: Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/)<br>
|
| 40 |
+
* Number of Parameters: < 1B<br>
|
| 41 |
+
|
| 42 |
+
|
| 43 |
+
## Computational Load (For NVIDIA Models Only)
|
| 44 |
+
**Cumulative Compute:** 2.2e+22 <br>
|
| 45 |
+
**Estimated Energy and Emissions for Model Training:**
|
| 46 |
+
Energy Consumption: 7,827.46 kWh <br>
|
| 47 |
+
Carbon Emissions: 3.21 tCO2e <br>
|
| 48 |
+
|
| 49 |
+
### Input
|
| 50 |
+
* Input Type: Image, Text<br>
|
| 51 |
+
* Input Type(s): Red, Green, Blue (RGB) + Prompt (String)
|
| 52 |
+
* Input Parameters: 2D, 1D
|
| 53 |
+
- Other Properties Related to Input:
|
| 54 |
+
- Max Input Resolution (Width, Height): 1648, 2048
|
| 55 |
+
- Min Input Resolution (Width, Height): 1024, 1280
|
| 56 |
+
- Channel Count: 3
|
| 57 |
+
|
| 58 |
+
### Output
|
| 59 |
+
* Output Type: Text<br>
|
| 60 |
+
* Output Format: String
|
| 61 |
+
* Output Parameters: 1D
|
| 62 |
+
- Other Properties Related to Output:
|
| 63 |
+
- nemotron-parse output format is a string which encodes text content (formatted or not) as well as bounding boxes and class attributes.<br>
|
| 64 |
+
|
| 65 |
+
Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.<br>
|
| 66 |
+
|
| 67 |
+
## Software Integration:
|
| 68 |
+
|
| 69 |
+
Runtime Engine(s): TensorRT-LLM
|
| 70 |
+
|
| 71 |
+
Supported Hardware Microarchitecture Compatibility: <br>
|
| 72 |
+
NVIDIA Hopper/NVIDIA Ampere/NVIDIA Turing<br>
|
| 73 |
+
|
| 74 |
+
Supported Operating System(s): Linux<br>
|
| 75 |
+
|
| 76 |
+
The integration of foundation and fine-tuned models into AI systems requires additional testing using use-case-specific data to ensure safe and effective deployment. Following the V-model methodology, iterative testing and validation at both unit and system levels are essential to mitigate risks, meet technical and functional requirements, and ensure compliance with safety and ethical standards before deployment.<br>
|
| 77 |
+
|
| 78 |
+
## Model Version:
|
| 79 |
+
|
| 80 |
+
V1.1
|
| 81 |
+
|
| 82 |
+
## Quick Start
|
| 83 |
+
|
| 84 |
+
### Install dependencies
|
| 85 |
+
|
| 86 |
+
```bash
|
| 87 |
+
pip install -r requirements.txt
|
| 88 |
+
```
|
| 89 |
+
|
| 90 |
+
### Usage example
|
| 91 |
+
|
| 92 |
+
```python
|
| 93 |
+
import torch
|
| 94 |
+
from PIL import Image, ImageDraw
|
| 95 |
+
from transformers import AutoModel, AutoProcessor, AutoTokenizer, AutoConfig, AutoImageProcessor, GenerationConfig
|
| 96 |
+
from postprocessing import extract_classes_bboxes, transform_bbox_to_original, postprocess_text
|
| 97 |
+
|
| 98 |
+
# Load model and processor
|
| 99 |
+
model_path = "nvidia/NVIDIA-Nemotron-Parse-v1.1" # Or use a local path
|
| 100 |
+
device = "cuda:0"
|
| 101 |
+
|
| 102 |
+
model = AutoModel.from_pretrained(
|
| 103 |
+
model_path,
|
| 104 |
+
trust_remote_code=True,
|
| 105 |
+
torch_dtype=torch.bfloat16
|
| 106 |
+
).to(device).eval()
|
| 107 |
+
tokenizer = AutoTokenizer.from_pretrained(model_path)
|
| 108 |
+
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
|
| 109 |
+
|
| 110 |
+
# Load image
|
| 111 |
+
image = Image.open("path/to/your/image.jpg")
|
| 112 |
+
task_prompt = "</s><s><predict_bbox><predict_classes><output_markdown>"
|
| 113 |
+
|
| 114 |
+
# Process image
|
| 115 |
+
inputs = processor(images=[image], text=task_prompt, return_tensors="pt").to(device)
|
| 116 |
+
prompt_ids = processor.tokenizer.encode(task_prompt, return_tensors="pt", add_special_tokens=False).cuda()
|
| 117 |
+
|
| 118 |
+
|
| 119 |
+
generation_config = GenerationConfig.from_pretrained(model_path, trust_remote_code=True)
|
| 120 |
+
# Generate text
|
| 121 |
+
outputs = model.generate(**inputs, generation_config=generation_config)
|
| 122 |
+
|
| 123 |
+
# Decode the generated text
|
| 124 |
+
generated_text = processor.batch_decode(outputs, skip_special_tokens=True)[0]
|
| 125 |
+
classes, bboxes, texts = extract_classes_bboxes(generated_text)
|
| 126 |
+
bboxes = [transform_bbox_to_original(bbox, image.width, image.height) for bbox in bboxes]
|
| 127 |
+
|
| 128 |
+
# Specify output formats for postprocessing
|
| 129 |
+
table_format = 'latex' # latex | HTML | markdown
|
| 130 |
+
text_format = 'markdown' # markdown | plain
|
| 131 |
+
blank_text_in_figures = False # remove text inside 'Picture' class
|
| 132 |
+
texts = [postprocess_text(text, cls = cls, table_format=table_format, text_format=text_format, blank_text_in_figures=blank_text_in_figures) for text, cls in zip(texts, classes)]
|
| 133 |
+
|
| 134 |
+
for cl, bb, txt in zip(classes, bboxes, texts):
|
| 135 |
+
print(cl, ': ', txt)
|
| 136 |
+
|
| 137 |
+
draw = ImageDraw.Draw(image)
|
| 138 |
+
for bbox in bboxes:
|
| 139 |
+
draw.rectangle((bbox[0], bbox[1], bbox[2], bbox[3]), outline="red")
|
| 140 |
+
```
|
| 141 |
+
|
| 142 |
+
## Training, Testing, and Evaluation Datasets:
|
| 143 |
+
|
| 144 |
+
|
| 145 |
+
### Training Dataset
|
| 146 |
+
|
| 147 |
+
nemotron-parse is first pre-trained on our internal datasets: human, synthetic and automated.
|
| 148 |
+
Data Modality:
|
| 149 |
+
*Text
|
| 150 |
+
*Image<br>
|
| 151 |
+
Data Collection Method by Dataset: Hybrid: Human, Synthetic, Automated
|
| 152 |
+
Labeling Method by Dataset: Hybrid: Human, Synthetic, Automated
|
| 153 |
+
|
| 154 |
+
### Testing and Evaluation Dataset:
|
| 155 |
+
|
| 156 |
+
nemotron-parse is evaluated on multiple datasets for robustness, including public and internal dataset.
|
| 157 |
+
Data Collection Method by Dataset: Hybrid: Human, Synthetic, Automated
|
| 158 |
+
Labeling Method by Dataset: Hybrid: Human, Synthetic, Automated
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
## Inference
|
| 162 |
+
|
| 163 |
+
Runtime Engine(s): TensorRT-LLM
|
| 164 |
+
|
| 165 |
+
Test Hardware: NVIDIA H100# Synchronization
|
| 166 |
+
|
| 167 |
+
## Ethical Considerations
|
| 168 |
+
|
| 169 |
+
NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their supporting model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.
|
| 170 |
+
|
| 171 |
+
Please report security vulnerabilities or NVIDIA AI Concerns here.
|
| 172 |
+
|
| 173 |
+
**You are responsible for ensuring that your use of NVIDIA AI Models complies with all applicable laws.**
|
| 174 |
+
|
| 175 |
+
|
| 176 |
+
## Enterprise Support
|
| 177 |
+
Get access to knowledge base articles and support cases or [submit a ticket](https://www.nvidia.com/en-us/data-center/products/ai-enterprise-suite/support/).
|
| 178 |
+
|
bias.md
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Bias
|
| 2 |
+
| Field | Response |
|
| 3 |
+
| :---- | :---- |
|
| 4 |
+
| Participation considerations from adversely impacted groups [protected classes](https://www.senate.ca.gov/content/protected-classes) in model design and testing: | None |
|
| 5 |
+
| Measures taken to mitigate against unwanted bias: | Not applicable |
|
| 6 |
+
| Bias Metric (If Measured): | None |
|
explainability.md
ADDED
|
@@ -0,0 +1,15 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
## Explainability
|
| 2 |
+
|
| 3 |
+
| Field | Response |
|
| 4 |
+
| :---- | :---- |
|
| 5 |
+
| Intended Task/Domain: | image to text |
|
| 6 |
+
| Model Type: | Transformer-based vision-encoder-decoder model|
|
| 7 |
+
| Intended Users: | Generative AI creators working with conversational AI models and image content. |
|
| 8 |
+
| Output: | Text |
|
| 9 |
+
| Describe how the model works: | Generates text by predicting the next word or token based on the context provided in the input sequence using multiple self-attention layers. |
|
| 10 |
+
| Name the adversely impacted groups this has been tested to deliver comparable outcomes regardless of: | Not Applicable |
|
| 11 |
+
| Technical Limitations & Mitigation: | The model demonstrates weakness to alignment-breaking attacks. Users are advised to deploy language model guardrails alongside this model to prevent potentially harmful outputs. The Model may generate answers that are inaccurate, omit key information, or include irrelevant or redundant text. |
|
| 12 |
+
| Verified to have met prescribed NVIDIA quality standards: | Yes |
|
| 13 |
+
| Performance Metrics: | Accuracy, Throughput, and User-side throughput |
|
| 14 |
+
| Potential Known Risks: | The model was optimized explicitly for instruction following and as such is more susceptible to prompt injection and jailbreaking in various forms as a result of its instruction tuning. This means that the model should be paired with additional rails or system filtering to limit exposure to instructions from malicious sources \-- either directly or indirectly by retrieval (e.g. via visiting a website) \-- as they may yield outputs that can lead to harmful, system-level outcomes up to and including remote code execution in agentic systems when effective security controls including guardrails are not in place. The model was trained on data that contains toxic language and societal biases originally crawled from the internet. Therefore, the model may amplify those biases and return toxic responses especially when prompted with toxic prompts. The model may generate answers that may be inaccurate, omit key information, or include irrelevant or redundant text producing socially unacceptable or undesirable text, even if the prompt itself does not include anything explicitly offensive. |
|
| 15 |
+
| Licensing: | GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/) |
|
latex2html.py
ADDED
|
@@ -0,0 +1,448 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
from bs4 import BeautifulSoup
|
| 3 |
+
|
| 4 |
+
def skip_whitespace(text, i):
|
| 5 |
+
"""Advance index i past any whitespace."""
|
| 6 |
+
while i < len(text) and text[i].isspace():
|
| 7 |
+
i += 1
|
| 8 |
+
return i
|
| 9 |
+
|
| 10 |
+
def parse_braced_argument(text, i):
|
| 11 |
+
"""
|
| 12 |
+
Given text and an index i that should point at an opening '{',
|
| 13 |
+
return a tuple (argument_content, new_index) where argument_content is the full
|
| 14 |
+
string inside the balanced braces and new_index is the position just after the matching '}'.
|
| 15 |
+
"""
|
| 16 |
+
if i >= len(text) or text[i] != '{':
|
| 17 |
+
raise ValueError("Expected '{' at position {}".format(i))
|
| 18 |
+
i += 1 # skip the opening brace
|
| 19 |
+
start = i
|
| 20 |
+
level = 1
|
| 21 |
+
while i < len(text) and level > 0:
|
| 22 |
+
if text[i] == '{':
|
| 23 |
+
level += 1
|
| 24 |
+
elif text[i] == '}':
|
| 25 |
+
level -= 1
|
| 26 |
+
i += 1
|
| 27 |
+
if level != 0:
|
| 28 |
+
raise ValueError("Unbalanced braces starting at position {}".format(start-1))
|
| 29 |
+
# The argument content is from start to i-1 (excluding the closing brace)
|
| 30 |
+
return text[start:i-1], i
|
| 31 |
+
|
| 32 |
+
def parse_command(text, i):
|
| 33 |
+
"""
|
| 34 |
+
Parse a \multirow or \multicolumn command starting at index i.
|
| 35 |
+
This function assumes the command has exactly three braced arguments.
|
| 36 |
+
|
| 37 |
+
It processes each argument recursively. For the third argument, after recursive processing,
|
| 38 |
+
it replaces any unescaped & with \&.
|
| 39 |
+
|
| 40 |
+
Returns a tuple (command_text, new_index) where command_text is the reconstructed command.
|
| 41 |
+
"""
|
| 42 |
+
# Determine which command we have.
|
| 43 |
+
if text.startswith(r"\multirow", i):
|
| 44 |
+
command_name = r"\multirow"
|
| 45 |
+
i += len(r"\multirow")
|
| 46 |
+
elif text.startswith(r"\multicolumn", i):
|
| 47 |
+
command_name = r"\multicolumn"
|
| 48 |
+
i += len(r"\multicolumn")
|
| 49 |
+
else:
|
| 50 |
+
raise ValueError("Expected \\multirow or \\multicolumn at position {}".format(i))
|
| 51 |
+
|
| 52 |
+
# Skip whitespace between the command name and the first argument.
|
| 53 |
+
i = skip_whitespace(text, i)
|
| 54 |
+
args = []
|
| 55 |
+
# Expect exactly three arguments
|
| 56 |
+
for arg_index in range(3):
|
| 57 |
+
if i >= len(text) or text[i] != '{':
|
| 58 |
+
raise ValueError("Expected '{' for argument {} at position {}".format(arg_index+1, i))
|
| 59 |
+
arg_content, i = parse_braced_argument(text, i)
|
| 60 |
+
# Process the content recursively to catch nested commands
|
| 61 |
+
processed_arg = clean_multi_cells(arg_content)
|
| 62 |
+
if arg_index == 2:
|
| 63 |
+
# For the cell text (third argument), replace any unescaped &
|
| 64 |
+
processed_arg = re.sub(r'(?<!\\)&', r'\\&', processed_arg)
|
| 65 |
+
args.append(processed_arg)
|
| 66 |
+
# Only skip whitespace between arguments, not after the last one.
|
| 67 |
+
if arg_index < 2:
|
| 68 |
+
i = skip_whitespace(text, i)
|
| 69 |
+
# Reconstruct the full command with its three arguments
|
| 70 |
+
command_text = f"{command_name}{{{args[0]}}}{{{args[1]}}}{{{args[2]}}}"
|
| 71 |
+
return command_text, i
|
| 72 |
+
|
| 73 |
+
def clean_multi_cells(text):
|
| 74 |
+
"""
|
| 75 |
+
Process an arbitrary LaTeX text string and look for occurrences of \multirow or \multicolumn commands.
|
| 76 |
+
When found, the command is parsed (handling nested braces and nested commands) and its third argument is fixed.
|
| 77 |
+
|
| 78 |
+
Returns the processed text.
|
| 79 |
+
"""
|
| 80 |
+
result = []
|
| 81 |
+
i = 0
|
| 82 |
+
while i < len(text):
|
| 83 |
+
# Find next occurrence of either command.
|
| 84 |
+
idx_multi = text.find(r"\multirow", i)
|
| 85 |
+
idx_multiC = text.find(r"\multicolumn", i)
|
| 86 |
+
|
| 87 |
+
# Determine the next index among the two (if any)
|
| 88 |
+
if idx_multi == -1 and idx_multiC == -1:
|
| 89 |
+
result.append(text[i:])
|
| 90 |
+
break
|
| 91 |
+
if idx_multi == -1:
|
| 92 |
+
next_idx = idx_multiC
|
| 93 |
+
elif idx_multiC == -1:
|
| 94 |
+
next_idx = idx_multi
|
| 95 |
+
else:
|
| 96 |
+
next_idx = min(idx_multi, idx_multiC)
|
| 97 |
+
|
| 98 |
+
# Append text before the command (preserving any whitespace)
|
| 99 |
+
result.append(text[i:next_idx])
|
| 100 |
+
# Process the command starting at next_idx
|
| 101 |
+
command_text, new_index = parse_command(text, next_idx)
|
| 102 |
+
result.append(command_text)
|
| 103 |
+
i = new_index
|
| 104 |
+
return ''.join(result)
|
| 105 |
+
|
| 106 |
+
def parse_brace(s, pos):
|
| 107 |
+
"""
|
| 108 |
+
Given a string s and an index pos pointing to an opening '{',
|
| 109 |
+
returns a tuple (content, new_pos) where content is the string
|
| 110 |
+
between the matching braces (handling nested braces) and new_pos is
|
| 111 |
+
the index just after the closing '}'.
|
| 112 |
+
"""
|
| 113 |
+
if pos >= len(s) or s[pos] != '{':
|
| 114 |
+
raise ValueError("Expected '{' at position %d" % pos)
|
| 115 |
+
pos += 1 # skip the opening brace
|
| 116 |
+
content = ""
|
| 117 |
+
depth = 1
|
| 118 |
+
while pos < len(s) and depth:
|
| 119 |
+
char = s[pos]
|
| 120 |
+
if char == '{':
|
| 121 |
+
depth += 1
|
| 122 |
+
content += char
|
| 123 |
+
elif char == '}':
|
| 124 |
+
depth -= 1
|
| 125 |
+
if depth:
|
| 126 |
+
content += char
|
| 127 |
+
else:
|
| 128 |
+
content += char
|
| 129 |
+
pos += 1
|
| 130 |
+
if depth != 0:
|
| 131 |
+
raise ValueError("Unmatched '{' in string.")
|
| 132 |
+
return content, pos
|
| 133 |
+
|
| 134 |
+
def parse_command_merge(s, pos):
|
| 135 |
+
"""
|
| 136 |
+
Parse a multirow or multicolumn command starting at s[pos]. If the content
|
| 137 |
+
of the command contains a nested command, then recursively parse the inner
|
| 138 |
+
command and merge its parameters with the outer ones. The merging is done
|
| 139 |
+
so that the outer multirow’s parameters (e.g. rowspan and width) are kept
|
| 140 |
+
while the inner command’s parameters (e.g. colspan, alignment) and its innermost
|
| 141 |
+
content are returned.
|
| 142 |
+
|
| 143 |
+
Returns a tuple (merged_dict, new_pos) where merged_dict is a dictionary
|
| 144 |
+
containing the combined parameters and new_pos is the updated index after
|
| 145 |
+
parsing the command.
|
| 146 |
+
"""
|
| 147 |
+
if s.startswith(r"\multirow", pos):
|
| 148 |
+
newpos = pos + len(r"\multirow")
|
| 149 |
+
# Parse the three required arguments for multirow: rowspan, width, and content.
|
| 150 |
+
rowspan, newpos = parse_brace(s, newpos)
|
| 151 |
+
width, newpos = parse_brace(s, newpos)
|
| 152 |
+
content, newpos = parse_brace(s, newpos)
|
| 153 |
+
# Look for a nested command (either \multirow or \multicolumn) in the content.
|
| 154 |
+
index_mr = content.find(r"\multirow")
|
| 155 |
+
index_mc = content.find(r"\multicolumn")
|
| 156 |
+
if index_mr == -1 and index_mc == -1:
|
| 157 |
+
# No nested command found; return this command’s details.
|
| 158 |
+
return {"rowspan": rowspan.strip(), "width": width.strip(), "content": content.strip()}, newpos
|
| 159 |
+
else:
|
| 160 |
+
# At least one nested command is present. Pick the first occurrence.
|
| 161 |
+
indices = [i for i in (index_mr, index_mc) if i != -1]
|
| 162 |
+
first_index = min(indices)
|
| 163 |
+
# Parse the inner (nested) command from within the content.
|
| 164 |
+
inner, _ = parse_command_merge(content, first_index)
|
| 165 |
+
# Merge: keep the outer multirow’s parameters and add the inner ones.
|
| 166 |
+
merged = {"rowspan": rowspan.strip(), "width": width.strip()}
|
| 167 |
+
merged.update(inner)
|
| 168 |
+
return merged, newpos
|
| 169 |
+
|
| 170 |
+
elif s.startswith(r"\multicolumn", pos):
|
| 171 |
+
newpos = pos + len(r"\multicolumn")
|
| 172 |
+
# Parse the three arguments for multicolumn: colspan, alignment, and content.
|
| 173 |
+
colspan, newpos = parse_brace(s, newpos)
|
| 174 |
+
alignment, newpos = parse_brace(s, newpos)
|
| 175 |
+
content, newpos = parse_brace(s, newpos)
|
| 176 |
+
# Look for a nested command in the content.
|
| 177 |
+
index_mr = content.find(r"\multirow")
|
| 178 |
+
index_mc = content.find(r"\multicolumn")
|
| 179 |
+
if index_mr == -1 and index_mc == -1:
|
| 180 |
+
return {"colspan": colspan.strip(), "alignment": alignment.strip(), "content": content.strip()}, newpos
|
| 181 |
+
else:
|
| 182 |
+
indices = [i for i in (index_mr, index_mc) if i != -1]
|
| 183 |
+
first_index = min(indices)
|
| 184 |
+
inner, _ = parse_command_merge(content, first_index)
|
| 185 |
+
merged = {"colspan": colspan.strip(), "alignment": alignment.strip()}
|
| 186 |
+
merged.update(inner)
|
| 187 |
+
return merged, newpos
|
| 188 |
+
|
| 189 |
+
# Not a recognized command starting at pos.
|
| 190 |
+
return None, pos
|
| 191 |
+
|
| 192 |
+
def extract_merged_commands(s):
|
| 193 |
+
"""
|
| 194 |
+
Scan through the LaTeX string s and extract merged multirow/multicolumn commands.
|
| 195 |
+
For each command found, if there is nesting the parser merges the outer and inner
|
| 196 |
+
parameters so that the final result includes both the rowspan (or width) and the colspan
|
| 197 |
+
(or alignment) along with the innermost content.
|
| 198 |
+
|
| 199 |
+
Returns a list of dictionaries.
|
| 200 |
+
"""
|
| 201 |
+
pos = 0
|
| 202 |
+
results = []
|
| 203 |
+
while pos < len(s):
|
| 204 |
+
if s[pos] == '\\':
|
| 205 |
+
res, newpos = parse_command_merge(s, pos)
|
| 206 |
+
if res is not None:
|
| 207 |
+
results.append(res)
|
| 208 |
+
pos = newpos
|
| 209 |
+
continue
|
| 210 |
+
pos += 1
|
| 211 |
+
return results
|
| 212 |
+
|
| 213 |
+
def remove_tags(html, tags_to_remove):
|
| 214 |
+
soup = BeautifulSoup(html, "html.parser")
|
| 215 |
+
# Loop through the tags to remove
|
| 216 |
+
for tag_name in tags_to_remove:
|
| 217 |
+
for tag in soup.find_all(tag_name):
|
| 218 |
+
# Move the children of the tag to the parent tag
|
| 219 |
+
tag.unwrap() # This removes the tag but keeps its contents
|
| 220 |
+
# Return the modified HTML as a string
|
| 221 |
+
return str(soup)
|
| 222 |
+
|
| 223 |
+
def convert_th_to_td(html):
|
| 224 |
+
"""Replace all th tags with td tags
|
| 225 |
+
"""
|
| 226 |
+
soup = BeautifulSoup(html)
|
| 227 |
+
for th_tag in soup.find_all('th'):
|
| 228 |
+
th_tag.name = 'td'
|
| 229 |
+
return str(soup)
|
| 230 |
+
|
| 231 |
+
def replace_italic(text):
|
| 232 |
+
pattern = re.compile(r'(?<!\\)_(.*?)(?<!\\)_')
|
| 233 |
+
|
| 234 |
+
def italic_replacer(match):
|
| 235 |
+
# Get the text inside the underscores.
|
| 236 |
+
content = match.group(1)
|
| 237 |
+
# Remove the escape (backslash) from any escaped underscores inside.
|
| 238 |
+
content = content.replace(r'\_', '_')
|
| 239 |
+
return f"<i>{content}</i>"
|
| 240 |
+
|
| 241 |
+
# Replace all occurrences of the pattern using the replacer function.
|
| 242 |
+
return pattern.sub(italic_replacer, text)
|
| 243 |
+
|
| 244 |
+
|
| 245 |
+
def replace_bold(text):
|
| 246 |
+
pattern = re.compile(r'(?<!\\)\*\*(.*?)(?<!\\)\*\*')
|
| 247 |
+
|
| 248 |
+
def bold_replacer(match):
|
| 249 |
+
content = match.group(1)
|
| 250 |
+
# Unescape any escaped asterisks within the captured text.
|
| 251 |
+
content = content.replace(r'\*', '*')
|
| 252 |
+
return f"<b>{content}</b>"
|
| 253 |
+
|
| 254 |
+
return pattern.sub(bold_replacer, text)
|
| 255 |
+
|
| 256 |
+
def latex_table_to_html(latex_str, add_head_body = False):
|
| 257 |
+
# Pattern to match the entire tabular environment
|
| 258 |
+
table_pattern = r'\\begin{tabular}{([^}]*)}\s*(.*?)\\end{tabular}'
|
| 259 |
+
|
| 260 |
+
def process_cell(cell):
|
| 261 |
+
# Clean up cell content
|
| 262 |
+
cell = cell.strip()
|
| 263 |
+
|
| 264 |
+
out = extract_merged_commands(cell)
|
| 265 |
+
if len(out) > 0:
|
| 266 |
+
cell = process_cell(out[0]["content"])["content"]
|
| 267 |
+
rowspan = int(out[0].get("rowspan", "1"))
|
| 268 |
+
colspan = int(out[0].get("colspan", "1"))
|
| 269 |
+
return {
|
| 270 |
+
"content": cell,
|
| 271 |
+
"colspan": colspan,
|
| 272 |
+
"rowspan": rowspan
|
| 273 |
+
}
|
| 274 |
+
|
| 275 |
+
# Replace latex and markdown formatting with HTML tags
|
| 276 |
+
cell = re.sub(r'\$([^$]*)\$', r'\1', cell) # Remove math mode
|
| 277 |
+
cell = re.sub(r'\\textbf{([^}]*)}', r'<b>\1</b>', cell) # Convert latex bold
|
| 278 |
+
cell = re.sub(r'\\textit{([^}]*)}', r'<i>\1</i>', cell) # Convert latex italic
|
| 279 |
+
cell = replace_italic(cell)
|
| 280 |
+
cell = replace_bold(cell)
|
| 281 |
+
cell = cell.replace("\\$", "$").replace("\\%", "%").replace("\\newline", "\n").replace("\\textless", "<").replace("\\textgreater", ">").replace("\\*", "*").replace("\\_", "_").replace("\\backslash", "\\")
|
| 282 |
+
|
| 283 |
+
# Replace \& with & in the cell text
|
| 284 |
+
cell = cell.replace(r'\&', '&')
|
| 285 |
+
cell = cell.replace('<tbc>', '')
|
| 286 |
+
# Preserve newlines for downstream row-splitting; clean other tokens
|
| 287 |
+
cell = cell.replace('\\unknown', '').replace('\\<|unk|\\>', '').replace('<u>', '<underline>').replace('</u>', '</underline>')
|
| 288 |
+
return {
|
| 289 |
+
'content': cell,
|
| 290 |
+
'colspan': 1,
|
| 291 |
+
'rowspan': 1
|
| 292 |
+
}
|
| 293 |
+
|
| 294 |
+
def split_row(input_string):
|
| 295 |
+
# Use a regular expression to split on '&' that is not preceded by a backslash
|
| 296 |
+
return re.split(r'(?<!\\)&', input_string)
|
| 297 |
+
|
| 298 |
+
def convert_table(match):
|
| 299 |
+
# Extract table content
|
| 300 |
+
format_spec, content = match.groups()
|
| 301 |
+
|
| 302 |
+
# Start building HTML table
|
| 303 |
+
html = ['<table>']
|
| 304 |
+
|
| 305 |
+
# Track cells for multirow
|
| 306 |
+
multirow_tracker = set()
|
| 307 |
+
|
| 308 |
+
# Process rows
|
| 309 |
+
rows = re.split(r'\\\\', content)
|
| 310 |
+
current_row = 0
|
| 311 |
+
|
| 312 |
+
for row in rows:
|
| 313 |
+
if not row.strip():
|
| 314 |
+
continue
|
| 315 |
+
|
| 316 |
+
row = row.strip()
|
| 317 |
+
|
| 318 |
+
# Skip \hline
|
| 319 |
+
if '\\hline' in row:
|
| 320 |
+
row = row.replace('\\hline', '')
|
| 321 |
+
if not row.strip():
|
| 322 |
+
continue
|
| 323 |
+
|
| 324 |
+
row = clean_multi_cells(row)
|
| 325 |
+
|
| 326 |
+
# Process cells
|
| 327 |
+
cells = split_row(row)
|
| 328 |
+
processed_cells = [process_cell(cell) for cell in cells]
|
| 329 |
+
|
| 330 |
+
# Build per-cell line lists splitting on newline or <br> tokens
|
| 331 |
+
def split_lines(text):
|
| 332 |
+
parts = re.split(r'(?:\n|<br\s*/?>)+', text)
|
| 333 |
+
return parts if parts is not None else ['']
|
| 334 |
+
|
| 335 |
+
line_lists = [split_lines(cell['content']) for cell in processed_cells]
|
| 336 |
+
max_lines = max(len(lst) for lst in line_lists) if line_lists else 1
|
| 337 |
+
|
| 338 |
+
# Emit one or more rows based on max_lines
|
| 339 |
+
for line_idx in range(max_lines):
|
| 340 |
+
if add_head_body:
|
| 341 |
+
if current_row == 0:
|
| 342 |
+
html.append(' <thead>')
|
| 343 |
+
if current_row == 1:
|
| 344 |
+
html.append(' <tbody>')
|
| 345 |
+
html.append(' <tr>')
|
| 346 |
+
current_col = 0
|
| 347 |
+
|
| 348 |
+
for col_idx, cell in enumerate(processed_cells):
|
| 349 |
+
content_segment = line_lists[col_idx][line_idx] if line_idx < len(line_lists[col_idx]) else ''
|
| 350 |
+
|
| 351 |
+
attrs = []
|
| 352 |
+
if cell['colspan'] > 1:
|
| 353 |
+
attrs.append(f'colspan="{cell["colspan"]}"')
|
| 354 |
+
# Only apply original rowspan to the first emitted line
|
| 355 |
+
if cell['rowspan'] > 1 and line_idx == 0:
|
| 356 |
+
attrs.append(f'rowspan="{cell["rowspan"]}"')
|
| 357 |
+
for r in range(current_row + 1, current_row + cell['rowspan']):
|
| 358 |
+
for c in range(current_col, current_col + cell['colspan']):
|
| 359 |
+
multirow_tracker.add((r, c))
|
| 360 |
+
|
| 361 |
+
# If this position is covered by a prior rowspan, skip rendering a duplicate cell
|
| 362 |
+
if cell['rowspan'] > 1 and line_idx > 0:
|
| 363 |
+
current_col += cell['colspan']
|
| 364 |
+
continue
|
| 365 |
+
|
| 366 |
+
if (current_row, current_col) in multirow_tracker and content_segment == '' and cell["colspan"] == 1 and cell["rowspan"] == 1:
|
| 367 |
+
current_col += cell['colspan']
|
| 368 |
+
continue
|
| 369 |
+
|
| 370 |
+
attr_str = ' ' + ' '.join(attrs) if attrs else ''
|
| 371 |
+
cell_tag = 'td'
|
| 372 |
+
html.append(f' <{cell_tag}{attr_str}>{content_segment}</{cell_tag}>')
|
| 373 |
+
current_col += cell['colspan']
|
| 374 |
+
|
| 375 |
+
if add_head_body and current_row == 0:
|
| 376 |
+
html.append(' </thead>')
|
| 377 |
+
html.append(' </tr>')
|
| 378 |
+
current_row += 1
|
| 379 |
+
if add_head_body:
|
| 380 |
+
html.append(' </tbody>')
|
| 381 |
+
html.append('</table>')
|
| 382 |
+
return '\n'.join(html)
|
| 383 |
+
|
| 384 |
+
# Convert all tabular environments in the input
|
| 385 |
+
return re.sub(table_pattern, convert_table, latex_str, flags=re.DOTALL)
|
| 386 |
+
def convert_single_table(table):
|
| 387 |
+
"""
|
| 388 |
+
Convert a single HTML table to Markdown format.
|
| 389 |
+
|
| 390 |
+
Args:
|
| 391 |
+
table: BeautifulSoup table element
|
| 392 |
+
|
| 393 |
+
Returns:
|
| 394 |
+
str: Markdown table string
|
| 395 |
+
"""
|
| 396 |
+
markdown_lines = []
|
| 397 |
+
rows = table.find_all('tr')
|
| 398 |
+
|
| 399 |
+
for i, row in enumerate(rows):
|
| 400 |
+
cells = row.find_all(['td', 'th'])
|
| 401 |
+
if not cells:
|
| 402 |
+
continue
|
| 403 |
+
|
| 404 |
+
# Convert cells to text, handling nested elements
|
| 405 |
+
row_data = []
|
| 406 |
+
for cell in cells:
|
| 407 |
+
# Get text content, handling nested elements
|
| 408 |
+
cell_text = cell.get_text(separator=' ', strip=True)
|
| 409 |
+
# Escape pipe characters
|
| 410 |
+
cell_text = cell_text.replace('|', '\\|')
|
| 411 |
+
row_data.append(cell_text)
|
| 412 |
+
|
| 413 |
+
# Add row to markdown
|
| 414 |
+
markdown_lines.append('| ' + ' | '.join(row_data) + ' |')
|
| 415 |
+
|
| 416 |
+
# Add separator after header row
|
| 417 |
+
if i == 0:
|
| 418 |
+
separator = '| ' + ' | '.join(['---'] * len(cells)) + ' |'
|
| 419 |
+
markdown_lines.append(separator)
|
| 420 |
+
|
| 421 |
+
return '\n'.join(markdown_lines)
|
| 422 |
+
def convert_html_tables_to_markdown(html_content):
|
| 423 |
+
"""
|
| 424 |
+
Find all HTML tables and convert them to Markdown while preserving all other content.
|
| 425 |
+
|
| 426 |
+
Args:
|
| 427 |
+
html_content (str): HTML content that may contain tables
|
| 428 |
+
|
| 429 |
+
Returns:
|
| 430 |
+
str: HTML content with tables converted to Markdown
|
| 431 |
+
"""
|
| 432 |
+
soup = BeautifulSoup(html_content, 'html.parser')
|
| 433 |
+
|
| 434 |
+
# Find all tables
|
| 435 |
+
tables = soup.find_all('table')
|
| 436 |
+
|
| 437 |
+
if not tables:
|
| 438 |
+
return html_content # Return original content unchanged
|
| 439 |
+
|
| 440 |
+
# Convert each table to markdown and replace it
|
| 441 |
+
for table in tables:
|
| 442 |
+
markdown_table = convert_single_table(table)
|
| 443 |
+
|
| 444 |
+
# Create a new element to replace the table
|
| 445 |
+
replacement = soup.new_string('\n' + markdown_table + '\n')
|
| 446 |
+
table.replace_with(replacement)
|
| 447 |
+
|
| 448 |
+
return str(soup)
|
postprocessing.py
ADDED
|
@@ -0,0 +1,95 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
import re
|
| 2 |
+
from latex2html import convert_html_tables_to_markdown, latex_table_to_html
|
| 3 |
+
|
| 4 |
+
def extract_classes_bboxes(text: str):
|
| 5 |
+
_re_extract_class_bbox = re.compile(r'<x_(\d+(?:\.\d+)?)><y_(\d+(?:\.\d+)?)>(.*?)<x_(\d+(?:\.\d+)?)><y_(\d+(?:\.\d+)?)><class_([^>]+)>', re.DOTALL)
|
| 6 |
+
classes = []
|
| 7 |
+
bboxes = []
|
| 8 |
+
texts = []
|
| 9 |
+
for m in _re_extract_class_bbox.finditer(text):
|
| 10 |
+
x1, y1, text, x2, y2, cls = m.groups()
|
| 11 |
+
classes.append(cls)
|
| 12 |
+
bboxes.append((float(x1), float(y1), float(x2), float(y2)))
|
| 13 |
+
texts.append(text)
|
| 14 |
+
|
| 15 |
+
# TODO: Remove when fixed
|
| 16 |
+
classes = [
|
| 17 |
+
"Formula" if cls == "Inline-formula" else cls for cls in classes
|
| 18 |
+
]
|
| 19 |
+
assert "Page-number" not in classes
|
| 20 |
+
|
| 21 |
+
return classes, bboxes, texts
|
| 22 |
+
|
| 23 |
+
def transform_bbox_to_original(bbox, original_width, original_height, target_w=1648, target_h=2048):
|
| 24 |
+
# Replicate exact resize logic
|
| 25 |
+
aspect_ratio = original_width / original_height
|
| 26 |
+
new_height = original_height
|
| 27 |
+
new_width = original_width
|
| 28 |
+
|
| 29 |
+
if original_height > target_h:
|
| 30 |
+
new_height = target_h
|
| 31 |
+
new_width = int(new_height * aspect_ratio)
|
| 32 |
+
|
| 33 |
+
if new_width > target_w:
|
| 34 |
+
new_width = target_w
|
| 35 |
+
new_height = int(new_width / aspect_ratio)
|
| 36 |
+
|
| 37 |
+
resized_width = new_width
|
| 38 |
+
resized_height = new_height
|
| 39 |
+
|
| 40 |
+
# Calculate padding
|
| 41 |
+
pad_left = (target_w - resized_width) // 2
|
| 42 |
+
pad_top = (target_h - resized_height) // 2
|
| 43 |
+
|
| 44 |
+
# # Transform: use the ACTUAL resized dimensions, not the scale
|
| 45 |
+
# # X coords
|
| 46 |
+
left = ((bbox[0] * target_w) - pad_left) * original_width / resized_width
|
| 47 |
+
right = ((bbox[2] * target_w) - pad_left) * original_width / resized_width
|
| 48 |
+
|
| 49 |
+
# # Y coords - using original_height / resized_height directly
|
| 50 |
+
top = ((bbox[1] * target_h) - pad_top) * original_height / resized_height
|
| 51 |
+
bottom = ((bbox[3] * target_h) - pad_top) * original_height / resized_height
|
| 52 |
+
|
| 53 |
+
return left, top, right, bottom
|
| 54 |
+
|
| 55 |
+
def postprocess_text(text, cls = 'Text', text_format='markdown', table_format='latex', blank_text_in_figures=False):
|
| 56 |
+
assert text_format in ['markdown', 'plain'], 'Unknown text format. Supported: markdown | plain'
|
| 57 |
+
assert table_format in ['latex', 'HTML', 'markdown'], 'Unknown table format. Supported: latex | HTML | markdown'
|
| 58 |
+
if cls != 'Table':
|
| 59 |
+
if text_format == 'plain':
|
| 60 |
+
text = convert_mmd_to_plain_text_ours(text)
|
| 61 |
+
elif table_format == 'HTML':
|
| 62 |
+
text = latex_table_to_html(text)
|
| 63 |
+
elif table_format == 'markdown':
|
| 64 |
+
text = convert_html_tables_to_markdown(latex_table_to_html(text))
|
| 65 |
+
if blank_text_in_figures and cls == 'Picture':
|
| 66 |
+
text = ''
|
| 67 |
+
return text
|
| 68 |
+
|
| 69 |
+
def remove_nemotron_formatting(text):
|
| 70 |
+
text = text.replace('<tbc>', '')
|
| 71 |
+
mmd_text = mmd_text.replace('\\<|unk|\\>', '')
|
| 72 |
+
mmd_text = mmd_text.replace('\\unknown', '')
|
| 73 |
+
|
| 74 |
+
def convert_mmd_to_plain_text_ours(mmd_text):
|
| 75 |
+
mmd_text = re.sub(r'<sup>(.*?)</sup>', r'^{\\1}', mmd_text, flags=re.DOTALL)
|
| 76 |
+
mmd_text = re.sub(r'<sub>(.*?)</sub>', r'_{\\1}', mmd_text, flags=re.DOTALL)
|
| 77 |
+
mmd_text = mmd_text.replace('<br>', '\n')
|
| 78 |
+
|
| 79 |
+
# Remove headers (e.g., ##)
|
| 80 |
+
mmd_text = re.sub(r'#+\s', '', mmd_text)
|
| 81 |
+
|
| 82 |
+
# Remove bold (e.g., **)
|
| 83 |
+
mmd_text = re.sub(r'\*\*(.*?)\*\*', r'\1', mmd_text)
|
| 84 |
+
#mmd_text = mmd_text.replace("**","")
|
| 85 |
+
# Remove italic (e.g., *)
|
| 86 |
+
mmd_text = re.sub(r'\*(.*?)\*', r'\1', mmd_text)
|
| 87 |
+
# Remove emphasized text formatting (e.g., _)
|
| 88 |
+
mmd_text = re.sub(r'(?<!\w)_([^_]+)_', r'\1', mmd_text)
|
| 89 |
+
|
| 90 |
+
# Remove formulas inside paragraphs (e.g., \(R_{ij}(P^{a})=0\))
|
| 91 |
+
#mmd_text = re.sub(r'\\\((.*?)\\\)', '', mmd_text)
|
| 92 |
+
|
| 93 |
+
# Remove asterisk in lists
|
| 94 |
+
#mmd_text = re.sub(r'^\*\s', '', mmd_text, flags=re.MULTILINE)
|
| 95 |
+
return mmd_text.strip()
|
privacy.md
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
Field | Response |
|
| 2 |
+
| :---- | :---- |
|
| 3 |
+
| Generatable or reverse engineerable personal data? | No |
|
| 4 |
+
| Personal data used to create this model? | No |
|
| 5 |
+
| Was consent obtained for any personal data used? | Not Applicable |
|
| 6 |
+
| How often is the dataset reviewed? | Before Release |
|
| 7 |
+
| Is there provenance for all datasets used in training? | Yes |
|
| 8 |
+
| Does data labeling (annotation, metadata) comply with privacy laws? | Yes |
|
| 9 |
+
| Is data compliant with data subject requests for data correction or removal, if such a request was made? | No, not possible with externally-sourced data. |
|
| 10 |
+
| Applicable Privacy Policy | [NVIDIA Privacy Policy](https://www.nvidia.com/en-us/about-nvidia/privacy-policy/) |
|
safety.md
ADDED
|
@@ -0,0 +1,8 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
| Field | Response |
|
| 2 |
+
| :---- | :---- |
|
| 3 |
+
| Model Application Field(s): | Chat, Instruction Following, Chatbot Development, Code Generation, Reasoning, Customer Service |
|
| 4 |
+
| Describe the life critical impact (if present). | Not Applicable |
|
| 5 |
+
| Use Case Restrictions: | GOVERNING TERMS: The NIM container is governed by the [NVIDIA Software License Agreement](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-software-license-agreement/) and [Product-Specific Terms for NVIDIA AI Products](https://www.nvidia.com/en-us/agreements/enterprise-software/product-specific-terms-for-ai-products/). Use of this model is governed by the [NVIDIA Community Model License](https://www.nvidia.com/en-us/agreements/enterprise-software/nvidia-community-models-license/). Use of the tokenizer included in this model is governed by the [CC-BY-4.0 license](https://creativecommons.org/licenses/by/4.0/)|
|
| 6 |
+
| Model and dataset restrictions: | The Principle of least privilege (PoLP) is applied limiting access for dataset generation and model development. Restrictions enforce dataset access during training, and dataset license constraints adhered to. |
|
| 7 |
+
|
| 8 |
+
**You are responsible for ensuring that your use of NVIDIA AI Models complies with all applicable laws.**
|