Hall of Multimodal OCR VLMs and Demonstrations

Community Article Published October 31, 2025

1

[1.] Introduction

The landscape of multimodal vision-language models (VLMs) for OCR has evolved rapidly, pushing the boundaries far beyond traditional text extraction. Today, advanced OCR models many derived from fine-tuned, state-of-the-art VLMs excel not only at recognizing text but also at document retrieval, semantic search, and direct question answering over complex visual documents. With powerful vision backbones and robust language fusion, these models can now handle degraded or low-quality scans, interpret intricate elements such as tables, charts, and embedded images, and seamlessly combine textual content with visual cues. This progress enables users to interact with documents in more intelligent and holistic ways unlocking capabilities such as question answering, summarization, and knowledge retrieval across diverse document types and challenging visual scenarios.

Alongside this, we will discuss the latest trends in OCR models, the multilingual support offered by modern OCR systems, their unique features, OCR benchmark model comparisons, transformer-based implementations, and strategies for streamlining transformers compatibility—all of which are outlined in this blog article below.

[i] Model Implementations: Table-of-Models

A list of models currently under implementation is provided below. Navigate directly to the respective model implementation demos if you're not interested in the rest of the content.

DeepSeek-OCR dots.ocr Nanonets-OCR2-3B Chandra olmOCR-2-7B-1025
granite-docling-258M PaddleOCR-VL Logics-Parsing POINTS-Reader-OCR Qwen3-VL

[2.] Latest Trends and Model Sizes

The latest wave of multimodal OCR models demonstrates a clear trend toward larger, more capable architectures with advanced multilingual and document-understanding abilities. Modern models such as Chandra (9B) and OlmOCR-2 (8B) highlight the growing emphasis on large-scale processing and grounding-based reasoning, achieving benchmark-leading performances with scores exceeding 83.0 on average. At the same time, efficient mid-sized models like DeepSeek-OCR (3B), dots.ocr (3B), and Nanonets-OCR2-3B (4B) show that even compact architectures can deliver strong accuracy and versatility—handling tasks such as chart and table parsing, handwriting recognition, and HTML or structured markdown output.

Interestingly, smaller yet specialized models like Granite-Docling-258M and PaddleOCR-VL (0.9B) focus on multilingual robustness and lightweight deployments, supporting over a hundred languages and offering document-to-HTML conversion for legacy or low-quality scans. Across the landscape, the convergence of scalability, multilingual support, and structured output formats reflects a broader trend—models are no longer limited to mere text recognition but are evolving into intelligent systems capable of holistic document understanding, retrieval, and reasoning across diverse visual and linguistic domains.

2


[3.] Multilingual Support of OCR Models

One of the most transformative developments in modern OCR research is the rapid expansion of multilingual support within vision-language models. Contemporary OCR systems are no longer confined to English text recognition—they now extend to a vast range of global languages, scripts, and cultural document types. Models such as PaddleOCR-VL, supporting over 109 languages, and DeepSeek-OCR, which handles around 100 languages, showcase remarkable versatility in cross-lingual document understanding. These models excel at extracting structured content from multilingual sources while accurately interpreting handwritten notes, scanned forms, and even historical manuscripts.

Similarly, Chandra (9B) and Qwen3-VL (9B) demonstrate strong multilingual grounding, performing effectively across 30–40+ languages, including ancient or low-resource scripts. Smaller yet capable models like Granite-Docling-258M and Nanonets-OCR2-3B also provide multilingual flexibility while maintaining efficiency for lightweight applications. Interestingly, while OlmOCR-2 (8B) remains English-only, its high accuracy and batch-optimized design highlight that some models continue to prioritize speed and precision over multilingual diversity.

3

Overall, the latest generation of OCR VLMs is shifting toward inclusive, cross-lingual document intelligence, bridging linguistic barriers and making global content more accessible than ever before. This multilingual evolution allows OCR models not just to read—but to truly understand—the world’s written information across diverse scripts, contexts, and cultural domains.


[4.] Multimodal OCR Models Features

Modern multimodal OCR models stand out not only for their multilingual abilities but also for their rich feature sets that go far beyond basic text recognition. Models like Nanonets-OCR2-3B and DeepSeek-OCR exemplify this evolution, offering advanced capabilities such as structured markdown generation, HTML rendering, and the ability to interpret complex document elements like charts, tables, flowcharts, and handwritten notes. These features enable seamless transformation of visual content into structured, machine-readable formats suitable for automation, knowledge retrieval, and document analytics.

Meanwhile, PaddleOCR-VL focuses on chart and table-to-HTML conversion, making it ideal for digitizing older or low-quality documents across 100+ languages. On the other hand, Granite-Docling-258M introduces an innovative DocTags system and prompt-based element control, allowing users to guide OCR outputs dynamically through text instructions—a step toward interactive, controllable document parsing.

Larger vision-language architectures such as dots.ocr, Chandra, and OlmOCR-2 emphasize visual grounding, large-scale batch processing, and image extraction capabilities, blending language reasoning with spatial document understanding. Qwen3-VL, notable for its flexibility, supports all-format outputs and even handles ancient scripts and handwriting, extending OCR beyond contemporary text.

4

Collectively, these advancements mark a significant shift—multimodal OCR models are no longer passive readers but intelligent document interpreters, capable of reasoning over layout, visuals, and semantics to deliver context-aware and richly structured results.


[5.] OlmOCR-Benchmark: Model Comparisons

The OlmOCR Benchmark provides an insightful evaluation of how leading multimodal OCR models perform across a range of complex document understanding tasks. Among the tested models, Chandra (9B) leads with an impressive average score of 83.1 ±0.9, closely followed by OlmOCR-2 (8B) at 82.3 ±1.1, both showcasing strong grounding capabilities and large-scale document reasoning performance. dots.ocr (3B) also performs remarkably well, achieving an average score of 79.1 ±1.0, balancing efficiency with robust visual-text comprehension.

Mid-range models like DeepSeek-OCR (3B) demonstrate competitive results, scoring 75.4 ±1.0, while maintaining multilingual capabilities across 100 languages—highlighting the balance between cross-lingual understanding and structural accuracy. Although models such as Nanonets-OCR2-3B, PaddleOCR-VL, Granite-Docling-258M, and Qwen3-VL have not yet reported benchmark scores, their qualitative strengths in structured outputs, DocTags, and ancient-text handling make them valuable for specialized OCR applications.

OlmOCR Benchmark on Models - visual selection

Approach [olmOCR-bench]

OlmOCR-Bench: OlmOCR-Bench adopts a unique evaluation strategy by treating the assessment as a series of unit tests. For instance, the evaluation of tables involves verifying the relationships between selected cells within a given table. The benchmark utilizes publicly available PDFs, with annotations generated using a variety of proprietary VLMs. This approach has proven effective for evaluating performance primarily in the English language context.

Overall, the benchmark underscores a clear trend: larger multimodal OCR models tend to deliver higher accuracy and contextual understanding, while smaller and mid-sized models continue to offer excellent efficiency, multilingual flexibility, and deployment versatility. This balance between scale, capability, and specialization defines the evolving competitive landscape of OCR VLMs in 2025.


[6.] Transformers Implementations and Streamlining Compatibility

Here, we present the transformers implementations and demonstrations for several leading OCR models, including Nanonets-OCR2-3B, dots.ocr, OlmOCR-2-7B-1025, DeepSeek-OCR, and many more available on the Hugging Face Hub. These implementations highlight how modern transformer architectures can be streamlined for better compatibility, performance, and deployment across diverse OCR and document-understanding tasks.

All the demos listed below are running smoothly with the transformers dated : October 30, 2025. If you encounter any issues, please ping in the discussion.

[i] DeepSeek OCR

Let’s start with one of the latest and most trending models on the Hub — DeepSeek-OCR. It is a state-of-the-art, open-source optical character recognition (OCR) model developed by DeepSeek-AI, built around a novel approach called “Contextual Optical Compression.” Unlike traditional OCR systems that detect and classify individual glyphs, DeepSeek-OCR compresses document images into a compact set of visual tokens using an advanced DeepEncoder, which integrates both local and global attention mechanisms to efficiently process high-resolution inputs. These compressed vision tokens are then decoded by a language model that reconstructs the text while preserving complex document structures—such as tables, lists, and headings—directly into Markdown format.

Note: Previously, inference with the model DeepSeek-OCR ran smoothly on transformers==4.46.3. However, running it on newer versions of transformers caused compatibility issues related to LlamaAttention. These issues have now been identified and resolved, and the model runs smoothly with the latest transformers (v4.57.1) or any compatible version. The current setup uses transformers==4.57.1, torch==2.6.0+cu124 or above, with torch.version.cuda = 12.4, and has been tested successfully on an NVIDIA H200 MIG 3g.71gb device. Users can also choose to enable or disable different attention implementations—such as FlashAttention or SDPA—depending on their optimization or standardization requirements. We can also opt out the attention implementation if needed.

DeepSeek-OCR-Latest-Transformers [Hugging Face Demo]: DeepSeek-OCR Space

DeepSeek-OCR-Demo

Model Description Link
DeepSeek-OCR-Latest-Transformers Supports the latest transformers (v4.57.1), compatible with torch 2.6.0+cu124 or above, tested on NVIDIA H200 MIG 3g.71gb Model (Hugging Face)
Community Page Fix issues and experiment with new things Stranger Vision HF Page
Quick Start with Transformers🤗

You can directly use the code below with the implemented fixed model, or start with the same DeepSeek OCR model on Google Colab: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepSeek-OCR-Demo/deepseek_ocr_demo.ipynb

Install the required packages

transformers==4.57.1 # (Latest Version)
huggingface_hub
torchvision
matplotlib
accelerate
easydict
einops
spaces
pillow
gradio
addict
hf_xet
torch
numpy

Demo App

import gradio as gr
import torch
import requests
from transformers import AutoModel, AutoTokenizer
import spaces
from typing import Iterable
import os
import tempfile
from PIL import Image, ImageDraw
import re

css = """
#main-title h1 {
    font-size: 2.3em !important;
}
#output-title h2 {
    font-size: 2.1em !important;
}
"""

print("Determining device...")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"✅ Using device: {device}")

print("Loading model and tokenizer...")
model_name = "strangervisionhf/deepseek-ocr-latest-transformers" # -> Latest transformers version used for the model. (https://huggingface.co/deepseek-ai/DeepSeek-OCR)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

model = AutoModel.from_pretrained(
    model_name,
    trust_remote_code=True,
    use_safetensors=True,
).to(device).eval()

if device.type == 'cuda':
    model = model.to(torch.bfloat16)

print("✅ Model loaded successfully to device and in eval mode.")

def find_result_image(path):
    for filename in os.listdir(path):
        if "grounding" in filename or "result" in filename:
            try:
                image_path = os.path.join(path, filename)
                return Image.open(image_path)
            except Exception as e:
                print(f"Error opening result image {filename}: {e}")
    return None

@spaces.GPU
def process_ocr_task(image, model_size, task_type, ref_text):
    """
    Processes an image with DeepSeek-OCR. The model is already on the correct device.
    """
    if image is None:
        return "Please upload an image first.", None

    print("✅ Model is already on the designated device.")

    with tempfile.TemporaryDirectory() as output_path:
        # Build the prompt
        if task_type == "Free OCR":
            prompt = "<image>\nFree OCR."
        elif task_type == "Convert to Markdown":
            prompt = "<image>\n<|grounding|>Convert the document to markdown."
        elif task_type == "Parse Figure":
            prompt = "<image>\nParse the figure."
        elif task_type == "Locate Object by Reference":
            if not ref_text or ref_text.strip() == "":
                raise gr.Error("For the 'Locate' task, you must provide the reference text to find!")
            prompt = f"<image>\nLocate <|ref|>{ref_text.strip()}<|/ref|> in the image."
        else:
            prompt = "<image>\nFree OCR."

        temp_image_path = os.path.join(output_path, "temp_image.png")
        image.save(temp_image_path)

        size_configs = {
            "Tiny": {"base_size": 512, "image_size": 512, "crop_mode": False},
            "Small": {"base_size": 640, "image_size": 640, "crop_mode": False},
            "Base": {"base_size": 1024, "image_size": 1024, "crop_mode": False},
            "Large": {"base_size": 1280, "image_size": 1280, "crop_mode": False},
            "Gundam (Recommended)": {"base_size": 1024, "image_size": 640, "crop_mode": True},
        }
        config = size_configs.get(model_size, size_configs["Gundam (Recommended)"])

        print(f"🏃 Running inference with prompt: {prompt}")
        text_result = model.infer(
            tokenizer,
            prompt=prompt,
            image_file=temp_image_path,
            output_path=output_path,
            base_size=config["base_size"],
            image_size=config["image_size"],
            crop_mode=config["crop_mode"],
            save_results=True,
            test_compress=True,
            eval_mode=True,
        )

        print(f"====\n📄 Text Result: {text_result}\n====")

        result_image_pil = None
        pattern = re.compile(r"<\|det\|>\[\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]\]<\|/det\|>")
        matches = list(pattern.finditer(text_result))

        if matches:
            print(f"✅ Found {len(matches)} bounding box(es). Drawing on the original image.")
            image_with_bboxes = image.copy()
            draw = ImageDraw.Draw(image_with_bboxes)
            w, h = image.size

            for match in matches:
                coords_norm = [int(c) for c in match.groups()]
                x1_norm, y1_norm, x2_norm, y2_norm = coords_norm

                x1 = int(x1_norm / 1000 * w)
                y1 = int(y1_norm / 1000 * h)
                x2 = int(x2_norm / 1000 * w)
                y2 = int(y2_norm / 1000 * h)

                draw.rectangle([x1, y1, x2, y2], outline="red", width=3)

            result_image_pil = image_with_bboxes
        else:
            print("⚠️ No bounding box coordinates found in text result. Falling back to search for a result image file.")
            result_image_pil = find_result_image(output_path)

        return text_result, result_image_pil

with gr.Blocks(css=css) as demo:
    gr.Markdown("# **DeepSeek OCR [exp]**", elem_id="main-title")

    
    with gr.Row():
        with gr.Column(scale=1):
            image_input = gr.Image(type="pil", label="Upload Image", sources=["upload", "clipboard"])
            model_size = gr.Dropdown(choices=["Tiny", "Small", "Base", "Large", "Gundam (Recommended)"], value="Large", label="Resolution Size")
            task_type = gr.Dropdown(choices=["Free OCR", "Convert to Markdown", "Parse Figure", "Locate Object by Reference"], value="Convert to Markdown", label="Task Type")
            ref_text_input = gr.Textbox(label="Reference Text (for Locate task)", placeholder="e.g., the teacher, 20-10, a red car...", visible=False)
            submit_btn = gr.Button("Process Image", variant="primary")

        with gr.Column(scale=2):
            output_text = gr.Textbox(label="Output (OCR)", lines=8, show_copy_button=True)
            output_image = gr.Image(label="Layout Detection (If Any)", type="pil")
            
            with gr.Accordion("Note", open=False):
                gr.Markdown("Inference using Huggingface transformers on NVIDIA GPUs. This app is running with transformers version 4.57.1 and torch version 2.6.0.")
                
    def toggle_ref_text_visibility(task):
        return gr.Textbox(visible=True) if task == "Locate Object by Reference" else gr.Textbox(visible=False)

    task_type.change(fn=toggle_ref_text_visibility, inputs=task_type, outputs=ref_text_input)
    submit_btn.click(fn=process_ocr_task, inputs=[image_input, model_size, task_type, ref_text_input], outputs=[output_text, output_image])

if __name__ == "__main__":
    demo.queue(max_size=20).launch(share=True)

[ii] dots.ocr

dots.ocr is a state-of-the-art multilingual document parser that unifies layout detection and content recognition within a single vision-language model powered by a compact 1.7 billion-parameter large language model (LLM). It achieves state-of-the-art performance on benchmarks like OmniDocBench for text, tables, reading order, and formula recognition, rivaling even larger models. Its architecture is notably streamlined compared to conventional, multi-model pipelines, enabling easy task switching via prompt changes. dots.ocr supports multiple layout categories including captions, footnotes, formulas (output as LaTeX), tables (output as HTML), and others (output as Markdown), preserving the original text without translation and maintaining human reading order in its outputs.

Note: Previously, inference with the model dots.ocr would fail with the following error: Error: Error loading dots-ocr model: Received a NoneType for argument 'video_processor', but a BaseVideoProcessor was expected. This issue occurred with the latest versions of transformers. The problem has now been identified and resolved — the updated configuration and model weights ensure smooth inference with the latest transformers (v4.57.1) or any compatible version. The setup used for validation includes transformers==4.57.1 and torch==2.8.0+cu126 or above. With these updates, dots.ocr now runs reliably across modern transformer environments.

Dots.OCR [Hugging Face Demo]: Multimodal-OCR3

Dots-OCR-Demo

Model Name Description Link
dots.ocr (Latest Transformers Compatible) Fixed configuration for transformers v4.57.1, compatible with torch 2.8.0+cu126, ensuring stable inference Model (Hugging Face)
Community Page Fix issues and experiment with new things Stranger Vision HF Page
Quick Start with Transformers🤗

You can directly use the code below with the implemented fixed model, or start with the original dots.ocr model that has undergone code-level modifications. Here is the Colab notebook: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Dots.OCR-Notebook/DotsOCR.ipynb

Install the required packages

flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
transformers-stream-generator
transformers==4.57.1 # (Latest Version)
huggingface_hub
qwen-vl-utils
torchvision
matplotlib
accelerate
requests
einops
spaces
pillow
gradio
hf_xet
torch
numpy
timm
peft
av

Demo App

import os
import sys
import random
import uuid
import json
import time
from threading import Thread
from typing import Iterable
from huggingface_hub import snapshot_download

import gradio as gr
import spaces
import torch
import numpy as np
from PIL import Image
import cv2

from transformers import (
    AutoModelForCausalLM,
    AutoProcessor,
    TextIteratorStreamer,
)

from transformers.image_utils import load_image

css = """
#main-title h1 {
    font-size: 2.3em !important;
}
#output-title h2 {
    font-size: 2.1em !important;
}
"""

MAX_MAX_NEW_TOKENS = 4096
DEFAULT_MAX_NEW_TOKENS = 2048
MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

# Load Dots.OCR
MODEL_PATH_D = "strangervisionhf/dots.ocr-base-fix"
processor = AutoProcessor.from_pretrained(MODEL_PATH_D, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH_D,
    attn_implementation="flash_attention_2",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True
).eval()

@spaces.GPU
def generate_image(text: str, image: Image.Image,
                   max_new_tokens: int, temperature: float, top_p: float,
                   top_k: int, repetition_penalty: float):
    """
    Generates responses using the Dots.OCR model for image input.
    Yields raw text and Markdown-formatted text.
    """
    if image is None:
        yield "Please upload an image.", "Please upload an image."
        return

    messages = [{
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": text},
        ]
    }]
    prompt_full = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    inputs = processor(
        text=[prompt_full],
        images=[image],
        return_tensors="pt",
        padding=True).to(device)

    streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
    generation_kwargs = {
        **inputs,
        "streamer": streamer,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "repetition_penalty": repetition_penalty,
    }
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        buffer = buffer.replace("<|im_end|>", "")
        time.sleep(0.01)
        yield buffer, buffer

with gr.Blocks(css=css) as demo:
    gr.Markdown("# **Dots.OCR Demo**", elem_id="main-title")
    with gr.Row():
        with gr.Column(scale=2):
            image_query = gr.Textbox(label="Query Input", placeholder="Enter your query here...")
            image_upload = gr.Image(type="pil", label="Upload Image", height=290)
            image_submit = gr.Button("Submit", variant="primary")
        
            with gr.Accordion("Advanced options", open=False):
                max_new_tokens = gr.Slider(label="Max new tokens", minimum=1, maximum=MAX_MAX_NEW_TOKENS, step=1, value=DEFAULT_MAX_NEW_TOKENS)
                temperature = gr.Slider(label="Temperature", minimum=0.1, maximum=4.0, step=0.1, value=0.7)
                top_p = gr.Slider(label="Top-p (nucleus sampling)", minimum=0.05, maximum=1.0, step=0.05, value=0.9)
                top_k = gr.Slider(label="Top-k", minimum=1, maximum=1000, step=1, value=50)
                repetition_penalty = gr.Slider(label="Repetition penalty", minimum=1.0, maximum=2.0, step=0.05, value=1.1)
                
        with gr.Column(scale=3):
                gr.Markdown("## Output", elem_id="output-title")
                output = gr.Textbox(label="Raw Output Stream", interactive=False, lines=11, show_copy_button=True)
                with gr.Accordion("(Result.md)", open=False):
                    markdown_output = gr.Markdown(label="(Result.Md)")

    image_submit.click(
        fn=generate_image,
        inputs=[image_query, image_upload, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
        outputs=[output, markdown_output]
    )

if __name__ == "__main__":
    demo.queue(max_size=50).launch(mcp_server=True, ssr_mode=False, show_error=True)

[iii] Nanonets-OCR2-3B

Nanonets-OCR2-3B is a powerful, state-of-the-art image-to-markdown OCR model designed to transform complex documents into structured markdown enriched with intelligent content recognition and semantic tagging. Key features include LaTeX equation recognition with inline and display support, intelligent image description within img tags, and specialized handling for signatures, watermarks, and form checkboxes using standardized Unicode symbols. The model excels at extracting complex tables in both markdown and HTML formats and can also convert flowcharts and organizational charts into mermaid code. Trained on multilingual and handwritten documents, it supports numerous languages and provides direct answers to visual questions when present in the document. The model is built upon a Qwen2.5-VL-3B-Instruct base.

Nanonets-OCR2-3B [Hugging Face Demo]: Multimodal-OCR3

NanonetsOCR2

Quick Start with Transformers🤗

You can directly use the code below with the implemented Nanonets-OCR2-3B model, or start with the same Nanonets-OCR2-3B OCR model on Google Colab: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/c2f6d6f8121a59e80f85a9d3b11219a2f3bb0ed9/Nanonets-OCR2-3B/Nanonets_OCR2_3B.ipynb

Install the required packages

transformers==4.57.1 # (Latest Version)
transformers-stream-generator
huggingface_hub
qwen-vl-utils
torchvision
matplotlib
accelerate
requests
einops
spaces
pillow
gradio
hf_xet
torch
numpy
timm
peft
av

Demo App

import os
import sys
import time
from threading import Thread
from typing import Iterable

import gradio as gr
import spaces
import torch
from PIL import Image

from transformers import (
    Qwen2_5_VLForConditionalGeneration,
    AutoProcessor,
    TextIteratorStreamer,
)

from transformers.image_utils import load_image

css = """
#main-title h1 {
    font-size: 2.3em !important;
}
#output-title h2 {
    font-size: 2.1em !important;
}
"""

# Load Nanonets-OCR2-3B Model
MODEL_ID = "nanonets/Nanonets-OCR2-3B"
print(f"Loading model: {MODEL_ID}")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.float16
).to(device).eval()
print("Model loaded successfully.")

# --- Generation Function ---
@spaces.GPU
def generate_image(text: str, image: Image.Image,
                   max_new_tokens: int, temperature: float, top_p: float,
                   top_k: int, repetition_penalty: float):
    """
    Generates responses using the Nanonets-OCR2-3B model.
    Yields raw text and Markdown-formatted text.
    """
    if image is None:
        yield "Please upload an image.", "Please upload an image."
        return

    messages = [{
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": text},
        ]
    }]
    prompt_full = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    inputs = processor(
        text=[prompt_full],
        images=[image],
        return_tensors="pt",
        padding=True
    ).to(device)

    streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
    generation_kwargs = {
        **inputs,
        "streamer": streamer,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "repetition_penalty": repetition_penalty,
    }
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        buffer = buffer.replace("<|im_end|>", "")
        time.sleep(0.01)
        yield buffer, buffer

with gr.Blocks(css=css) as demo:
    gr.Markdown("# **Nanonets-OCR2-3B**", elem_id="main-title")
    with gr.Row():
        with gr.Column(scale=2):
            image_query = gr.Textbox(label="Query Input", placeholder="Enter your query here...")
            image_upload = gr.Image(type="pil", label="Upload Image", height=320)

            image_submit = gr.Button("Submit", variant="primary")
            
            with gr.Accordion("Advanced options", open=False):
                max_new_tokens = gr.Slider(label="Max new tokens", minimum=1, maximum=MAX_MAX_NEW_TOKENS, step=1, value=DEFAULT_MAX_NEW_TOKENS)
                temperature = gr.Slider(label="Temperature", minimum=0.1, maximum=4.0, step=0.1, value=0.7)
                top_p = gr.Slider(label="Top-p (nucleus sampling)", minimum=0.05, maximum=1.0, step=0.05, value=0.9)
                top_k = gr.Slider(label="Top-k", minimum=1, maximum=1000, step=1, value=50)
                repetition_penalty = gr.Slider(label="Repetition penalty", minimum=1.0, maximum=2.0, step=0.05, value=1.1)
 
        with gr.Column(scale=3):
            gr.Markdown("## Output", elem_id="output-title")
            output = gr.Textbox(label="Raw Output Stream", interactive=False, lines=15, show_copy_button=True)
            with gr.Accordion("(Result.md)", open=False):
                markdown_output = gr.Markdown(label="(Result.Md)")

    image_submit.click(
        fn=generate_image,
        inputs=[image_query, image_upload, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
        outputs=[output, markdown_output]
    )

if __name__ == "__main__":
    demo.queue(max_size=50).launch(show_error=True)

[iv] Chandra-OCR

Chandra is a highly accurate and advanced OCR model developed by Datalab that converts images and PDFs into richly structured outputs in HTML, Markdown, and JSON formats while preserving detailed document layout information. It supports over 40 languages and excels in handwriting recognition, form reconstruction including checkboxes, and handles complex tables, mathematical formulas, and diverse layouts with high precision. The model extracts images and diagrams along with their captions as structured data, making it suitable for comprehensive document understanding tasks. Chandra offers flexible deployment with two inference modes: local Hugging Face implementation and a remote vLLM server for scalable, high-speed batch processing.

Chandra-OCR [Hugging Face Demo]: Multimodal-OCR3

Chandra

Quick Start with Transformers🤗

You can directly use the code below with the implemented Chandra model, or start with the same Chandra OCR model on Google Colab: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Chandra-OCR/Chandra_OCR.ipynb

Install the required packages

transformers==4.57.1 # (Latest Version)
transformers-stream-generator
huggingface_hub
qwen-vl-utils
torchvision
matplotlib
accelerate
requests
einops
spaces
pillow
gradio
hf_xet
torch
numpy
timm
peft
av

Demo App

import os
import sys
import random
import uuid
import json
import time
from threading import Thread
from typing import Iterable
from huggingface_hub import snapshot_download

import gradio as gr
import spaces
import torch
import numpy as np
from PIL import Image
import cv2

from transformers import (
    Qwen3VLForConditionalGeneration,
    AutoProcessor,
    TextIteratorStreamer,
)

from transformers.image_utils import load_image

css = """
#main-title h1 {
    font-size: 2.3em !important;
}
#output-title h2 {
    font-size: 2.1em !important;
}
"""

MAX_MAX_NEW_TOKENS = 4096
DEFAULT_MAX_NEW_TOKENS = 1024
MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

# Load Chandra-OCR
MODEL_ID_V = "datalab-to/chandra"
processor = AutoProcessor.from_pretrained(MODEL_ID_V, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_ID_V,
    trust_remote_code=True,
    torch_dtype=torch.float16
).to(device).eval()

@spaces.GPU
def generate_image(text: str, image: Image.Image,
                   max_new_tokens: int, temperature: float, top_p: float,
                   top_k: int, repetition_penalty: float):
    """
    Generates responses using the Chandra-OCR model for image input.
    Yields raw text and Markdown-formatted text.
    """
    if image is None:
        yield "Please upload an image.", "Please upload an image."
        return

    messages = [{
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": text},
        ]
    }]
    prompt_full = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    inputs = processor(
        text=[prompt_full],
        images=[image],
        return_tensors="pt",
        padding=True).to(device)

    streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
    generation_kwargs = {
        **inputs,
        "streamer": streamer,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "repetition_penalty": repetition_penalty,
    }
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        buffer = buffer.replace("<|im_end|>", "")
        time.sleep(0.01)
        yield buffer, buffer

with gr.Blocks(css=css) as demo:
    gr.Markdown("# **Chandra-OCR**", elem_id="main-title")
    with gr.Row():
        with gr.Column(scale=2):
            image_query = gr.Textbox(label="Query Input", placeholder="Enter your query here...")
            image_upload = gr.Image(type="pil", label="Upload Image", height=290)

            image_submit = gr.Button("Submit", variant="primary")
           
            with gr.Accordion("Advanced options", open=False):
                max_new_tokens = gr.Slider(label="Max new tokens", minimum=1, maximum=MAX_MAX_NEW_TOKENS, step=1, value=DEFAULT_MAX_NEW_TOKENS)
                temperature = gr.Slider(label="Temperature", minimum=0.1, maximum=4.0, step=0.1, value=0.7)
                top_p = gr.Slider(label="Top-p (nucleus sampling)", minimum=0.05, maximum=1.0, step=0.05, value=0.9)
                top_k = gr.Slider(label="Top-k", minimum=1, maximum=1000, step=1, value=50)
                repetition_penalty = gr.Slider(label="Repetition penalty", minimum=1.0, maximum=2.0, step=0.05, value=1.1)
                
        with gr.Column(scale=3):
                gr.Markdown("## Output", elem_id="output-title")
                output = gr.Textbox(label="Raw Output Stream", interactive=False, lines=11, show_copy_button=True)
                with gr.Accordion("(Result.md)", open=False):
                    markdown_output = gr.Markdown(label="(Result.Md)")

    image_submit.click(
        fn=generate_image,
        inputs=[image_query, image_upload, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
        outputs=[output, markdown_output]
    )

if __name__ == "__main__":
    demo.queue(max_size=50).launch(mcp_server=True, ssr_mode=False, show_error=True)

[v] olmOCR-2-7B-1025

olmOCR-2-7B-1025 is a high-quality document OCR model built upon the Qwen2.5-VL-7B-Instruct foundation and fine-tuned on the olmOCR-mix-1025 dataset using reinforcement learning with verifiable rewards (GRPO). The model excels in recognizing complex document elements including LaTeX mathematical equations, structured HTML tables, and preserving detailed document structure such as headers, lists, and formatting. It supports natural reading order, multi-column layouts, and metadata for rotation detection, achieving strong accuracy with a score of 82.3 ± 1.1 on the olmOCR bench. Intended primarily for academic and research use, it integrates tightly with the olmOCR toolkit, enabling efficient inference, automatic rendering and retry mechanisms, and batch processing to handle millions of documents at scale. The model expects input images rendered with the longest dimension at 1288 pixels and offers manual prompting options. Licensed under Apache 2.0, olmOCR-2-7B-1025 advances OCR capabilities with a focus on layout preservation, mathematical accuracy, and scalable deployment.

olmOCR-2-7B-1025 [Hugging Face Demo]: Multimodal-OCR3

OlmOCR2

Quick Start with Transformers🤗

You can directly use the code below with the implemented olmOCR-2-7B-1025 model, or start with the same olmOCR-2-7B-1025 model on Google Colab: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/olmOCR-2-7B-1025/olmOCR_2_7B_1025.ipynb

Install the required packages

transformers==4.57.1 # (Latest Version)
transformers-stream-generator
huggingface_hub
qwen-vl-utils
torchvision
matplotlib
accelerate
requests
einops
spaces
pillow
gradio
hf_xet
torch
numpy
timm
peft
av

Demo App

import os
import sys
import random
import uuid
import json
import time
from threading import Thread
from typing import Iterable

import gradio as gr
import spaces
import torch
from PIL import Image

from transformers import (
    Qwen2_5_VLForConditionalGeneration,
    AutoProcessor,
    TextIteratorStreamer,
)

from transformers.image_utils import load_image

# Custom CSS for styling
css = """
#main-title h1 {
    font-size: 2.3em !important;
}
#output-title h2 {
    font-size: 2.1em !important;
}
"""

# --- Configuration ---
MAX_MAX_NEW_TOKENS = 4096
DEFAULT_MAX_NEW_TOKENS = 2048
MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

# --- Model Loading ---
# Load olmOCR-2-7B-1025
MODEL_ID = "allenai/olmOCR-2-7B-1025"
print(f"Loading model: {MODEL_ID}")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    attn_implementation="flash_attention_2" if torch.cuda.is_available() else "eager"
).to(device).eval()
print("Model loaded successfully.")

@spaces.GPU
def generate_response(text: str, image: Image.Image,
                      max_new_tokens: int, temperature: float, top_p: float,
                      top_k: int, repetition_penalty: float):
    """
    Generates responses using the olmOCR model for the given image and text prompt.
    Yields the generated text in a streaming manner.
    """
    if image is None:
        yield "Please upload an image.", "Please upload an image."
        return

    # Prepare the messages for the chat template
    messages = [{
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": text},
        ]
    }]
    
    prompt_full = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

    inputs = processor(
        text=[prompt_full],
        images=[image],
        return_tensors="pt",
        padding=True
    ).to(device)

    streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
    
    generation_kwargs = {
        **inputs,
        "streamer": streamer,
        "max_new_tokens": max_new_tokens,
        "do_sample": True,
        "temperature": temperature,
        "top_p": top_p,
        "top_k": top_k,
        "repetition_penalty": repetition_penalty,
    }
    
    # Run generation in a separate thread
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        buffer = buffer.replace("<|im_end|>", "")
        time.sleep(0.01)
        yield buffer, buffer

with gr.Blocks(css=css) as demo:
    gr.Markdown("# **olmOCR-2-7B Demo**", elem_id="main-title")
    gr.Markdown("This interface uses the `allenai/olmOCR-2-7B-1025` model for Optical Character Recognition.")
    
    with gr.Row():
        with gr.Column(scale=2):
            image_query = gr.Textbox(label="Query Input", placeholder="Enter your query here (e.g., 'Transcribe the text')...")
            image_upload = gr.Image(type="pil", label="Upload Image", height=320)

            image_submit = gr.Button("Submit", variant="primary")
            
            with gr.Accordion("Advanced Generation Options", open=False):
                max_new_tokens = gr.Slider(label="Max New Tokens", minimum=1, maximum=MAX_MAX_NEW_TOKENS, step=1, value=DEFAULT_MAX_NEW_TOKENS)
                temperature = gr.Slider(label="Temperature", minimum=0.1, maximum=2.0, step=0.1, value=0.7)
                top_p = gr.Slider(label="Top-p (nucleus sampling)", minimum=0.05, maximum=1.0, step=0.05, value=0.9)
                top_k = gr.Slider(label="Top-k", minimum=1, maximum=1000, step=1, value=50)
                repetition_penalty = gr.Slider(label="Repetition Penalty", minimum=1.0, maximum=2.0, step=0.05, value=1.1)
                
        with gr.Column(scale=3):
            gr.Markdown("## Output", elem_id="output-title")
            output_stream = gr.Textbox(label="Raw Output Stream", interactive=False, lines=15, show_copy_button=True)
            with gr.Accordion("Formatted Markdown Output", open=True):
                markdown_output = gr.Markdown(label="Formatted Result")

    # Connect the submit button to the generation function
    image_submit.click(
        fn=generate_response,
        inputs=[image_query, image_upload, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
        outputs=[output_stream, markdown_output]
    )

if __name__ == "__main__":
    demo.queue(max_size=50).launch(show_error=True)

Note: The models from [ii] to [v] — {Nanonets OCR2, OlmOCR-2-7B-1025, dots.ocr, and Chandra} — are comparatively demonstrated in the Multimodal-OCR3 space.

[vi] granite-docling-258M

IBM Granite-Docling-258M is a lightweight yet powerful vision-language model specifically designed for advanced document conversion and understanding. Unlike traditional OCR tools that often lose structural and contextual document information, Granite-Docling preserves complex layouts, including tables, code blocks, equations, and hierarchical relationships, through a unique markup system called DocTags. This proprietary format encodes document structure, spatial layout, and semantic relationships, enabling highly accurate reconstruction and downstream processing optimized for large language model workflows. The model features a compact architecture with 258 million parameters, making it cost-efficient and suitable for enterprise deployment while delivering high accuracy across tasks such as layout analysis, code recognition, mathematical formula extraction, and table structure recognition. Granite-Docling also extends multilingual support to scripts like Arabic, Chinese, and Japanese. It builds on the earlier SmolDocling model with a new Granite 3 language backbone and SigLIP2 visual encoder, emphasizing stability and robustness. The model integrates seamlessly into IBM’s open-source Docling pipeline and supports output conversion to Markdown, JSON, or HTML formats, positioning it as a next-generation solution for reliable, structured document AI processing.

granite-docling-258M [Hugging Face Demo]: granite-docling-258M

Screenshot 2025-10-31 at 09-15-01 granite-docling-258M demo - a Hugging Face Space by ibm-granite

[vii] PaddleOCR-VL

PaddleOCR-VL is a state-of-the-art, resource-efficient vision-language model specifically designed for document parsing. Its core architecture combines a NaViT-style dynamic high-resolution visual encoder with a lightweight ERNIE-4.5-0.3B language model, enabling accurate recognition of complex document elements such as text, tables, formulas, and charts. Supporting 109 languages , PaddleOCR-VL excels in both page-level document parsing and fine-grained element recognition, including challenging content like handwritten text and historical documents. The model achieves leading performance on public and internal benchmarks, surpassing many existing solutions and even competing with much larger models, all while maintaining fast inference speeds and a low computational footprint. This makes it highly suitable for real-world deployments requiring efficient, multilingual, and comprehensive document understanding.

PaddleOCR-VL [Hugging Face Demo]: PaddleOCR-VL_Online_Demo

Screenshot 2025-10-31 at 09-41-38 PaddleOCR-VL Online Demo - a Hugging Face Space by PaddlePaddle

[viii] Logics-Parsing

Logics-Parsing is a powerful end-to-end document parsing model built on a general Vision-Language Model foundation and enhanced through supervised fine-tuning and reinforcement learning. It excels at analyzing and accurately structuring highly complex documents including scientific papers with multi-column layouts, intricate formulas, chemical structures in SMILES format, and handwriting. The model outputs richly structured HTML preserving document logical structure with tagged content blocks that include categories, bounding boxes, and OCR text. It filters irrelevant elements such as headers and footers automatically to focus on core content. Logics-Parsing achieves state-of-the-art performance on a dedicated in-house benchmark covering diverse document types and STEM content, outperforming many existing tools in normalized edit distance metrics for text, formulas, tables, reading order, chemistry notations, and handwriting. Its single-model architecture simplifies deployment without the need for multi-stage pipelines, making it a robust solution for comprehensive document understanding and parsing tasks.

Logics-Parsing [Hugging Face Demo]: VLM-Parsing

VLM-Parsing

Quick Start with Transformers🤗

You can directly use the code below with the implemented Logics-Parsing model, or start with the same Logics-Parsing model on Google Colab: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Logics-Parsing-4bit/Logics_Parsing_4bit.ipynb

Install the required packages

git+https://github.com/Dao-AILab/flash-attention.git
transformers==4.57.1 # (Latest Version)
transformers-stream-generator
gradio_pdf==0.0.22
huggingface_hub
beautifulsoup4
qwen-vl-utils
torchvision
matplotlib
accelerate
html2text
requests
markdown
pymupdf
einops
spaces
pillow
gradio
hf_xet
torch
numpy
timm
peft
fpdf
av

Demo App

import os
import sys
from typing import Iterable, Optional, Tuple, Dict, Any, List
import hashlib
import spaces
import re
import time
import click
import gradio as gr
from io import BytesIO
from PIL import Image
from loguru import logger
from pathlib import Path
import torch
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from transformers.image_utils import load_image
import fitz 
import html2text
import markdown
import tempfile

from gradio.themes import Soft
from gradio.themes.utils import colors, fonts, sizes

pdf_suffixes = [".pdf"]
image_suffixes = [".png", ".jpeg", ".jpg"]
device = "cuda" if torch.cuda.is_available() else "cpu"

logger.info(f"Using device: {device}")

# Model: Logics-Parsing
MODEL_ID = "Logics-MLLM/Logics-Parsing"
logger.info(f"Loading model: {MODEL_ID}")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.float16 if device == "cuda" else torch.float32,
    attn_implementation="flash_attention_2" if device == "cuda" else "eager"
).to(device).eval()
logger.info(f"Model '{MODEL_ID}' loaded successfully.")


@spaces.GPU
def parse_page(image: Image.Image) -> str:
    """
    Processes a single image using the Logics-Parsing model to generate structured HTML.
    """
    # Define the prompt for the model
    messages = [{"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": "Parse this document page into a clean, structured HTML representation. Preserve the logical structure with appropriate tags for content blocks such as paragraphs (<p>), headings (<h1>-<h6>), tables (<table>), figures (<figure>), formulas (<formula>), and others. Include category tags, and filter out irrelevant elements like headers and footers."}
    ]}]

    prompt_full = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(text=prompt_full, images=[image.convert("RGB")], return_tensors="pt").to(device)

    with torch.no_grad():
        generated_ids = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
    
    # Decode the output, skipping the prompt tokens
    generated_ids = generated_ids[:, inputs['input_ids'].shape[1]:]
    output_text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
    return output_text

def convert_file_to_images(file_path: str, dpi: int = 200) -> List[Image.Image]:
    """
    Converts a given file (PDF or image) into a list of PIL Images.
    """
    images = []
    file_ext = Path(file_path).suffix.lower()
    
    if file_ext in image_suffixes:
        images.append(Image.open(file_path).convert("RGB"))
        return images
        
    if file_ext not in pdf_suffixes:
        raise ValueError(f"Unsupported file type: {file_ext}")

    try:
        pdf_document = fitz.open(file_path)
        zoom = dpi / 72.0
        mat = fitz.Matrix(zoom, zoom)
        for page_num in range(len(pdf_document)):
            page = pdf_document.load_page(page_num)
            pix = page.get_pixmap(matrix=mat)
            img_data = pix.tobytes("png")
            images.append(Image.open(BytesIO(img_data)).convert("RGB"))
        pdf_document.close()
    except Exception as e:
        logger.error(f"Failed to convert PDF using PyMuPDF: {e}")
        raise
    return images

def get_initial_state() -> Dict[str, Any]:
    """Returns a dictionary representing the initial state of the app."""
    return {"pages": [], "total_pages": 0, "current_page_index": 0, "page_results": []}

def load_and_preview_file(file_path: Optional[str]) -> Tuple[Optional[Image.Image], str, Dict[str, Any]]:
    """Loads a file, converts it to images, and prepares the initial preview."""
    state = get_initial_state()
    if not file_path:
        return None, '<div class="page-info">No file loaded</div>', state

    try:
        pages = convert_file_to_images(file_path)
        if not pages:
            return None, '<div class="page-info">Could not load file</div>', state
        
        state["pages"] = pages
        state["total_pages"] = len(pages)
        page_info_html = f'<div class="page-info">Page 1 / {state["total_pages"]}</div>'
        return pages[0], page_info_html, state
    except Exception as e:
        logger.error(f"Failed to load and preview file: {e}")
        return None, '<div class="page-info">Failed to load preview</div>', state

async def process_all_pages(state: Dict[str, Any], progress=gr.Progress(track_tqdm=True)):
    """
    Processes all pages in the loaded document, generates HTML and Markdown,
    and returns the results.
    """
    if not state or not state["pages"]:
        error_msg = "<h3>Please upload a file first.</h3>"
        return error_msg, "", "", None, "Error: No file to process", state

    logger.info(f'Processing {state["total_pages"]} pages with model: {MODEL_ID}')
    start_time = time.time()
    
    try:
        page_results = []
        for i, page_img in progress.tqdm(enumerate(state["pages"]), desc="Processing Pages"):
            html_result = parse_page(page_img)
            page_results.append({'raw_html': html_result})
        
        state["page_results"] = page_results
        
        # Combine results from all pages
        full_html_content = "\n\n".join([f'<!-- Page {i+1} -->\n{res["raw_html"]}' for i, res in enumerate(page_results)])
        full_markdown = html2text.html2text(full_html_content)

        # Create a temporary file for download
        with tempfile.NamedTemporaryFile(mode='w', suffix='.md', delete=False, encoding='utf-8') as f:
            f.write(full_markdown)
            md_path = f.name
            
        processing_time = time.time() - start_time
        cost_time_str = f'Total processing time: {processing_time:.2f}s'
        
        # Get outputs for the currently viewed page
        current_page_outputs = get_page_outputs(state)
        
        return *current_page_outputs, md_path, cost_time_str, state

    except Exception as e:
        logger.error(f"Parsing failed: {e}", exc_info=True)
        error_html = f"<h3>An error occurred during processing:</h3><p>{str(e)}</p>"
        return error_html, "", "", None, f"Error: {str(e)}", state

def navigate_page(direction: str, state: Dict[str, Any]):
    """Handles page navigation (previous/next)."""
    if not state or not state["pages"]:
        return None, '<div class="page-info">No file loaded</div>', *get_page_outputs(state), state

    current_index = state["current_page_index"]
    total_pages = state["total_pages"]
    
    if direction == "prev":
        new_index = max(0, current_index - 1)
    elif direction == "next":
        new_index = min(total_pages - 1, current_index + 1)
    else: # Should not happen
        new_index = current_index
        
    state["current_page_index"] = new_index
    
    image_preview = state["pages"][new_index]
    page_info_html = f'<div class="page-info">Page {new_index + 1} / {total_pages}</div>'
    
    page_outputs = get_page_outputs(state)
    
    return image_preview, page_info_html, *page_outputs, state

def get_page_outputs(state: Dict[str, Any]) -> Tuple[str, str, str]:
    """Generates the different output formats for the current page."""
    if not state or not state.get("page_results"):
        return "<h3>Process the document to see results.</h3>", "", ""

    index = state["current_page_index"]
    if index >= len(state["page_results"]):
        return "<h3>Result not available for this page.</h3>", "", ""
        
    result = state["page_results"][index]
    raw_html = result['raw_html']
    
    # Convert HTML to Markdown for source and rendering
    md_source = html2text.html2text(raw_html)
    md_render = markdown.markdown(md_source, extensions=['fenced_code', 'tables'])
    
    return md_render, md_source, raw_html

def clear_all():
    """Resets the entire Gradio interface to its initial state."""
    return None, None, "<h3>Results will be displayed here after processing.</h3>", "", "", None, "", '<div class="page-info">No file loaded</div>', get_initial_state()

@click.command()
def main():
    css = """
    .main-container { max-width: 1400px; margin: 0 auto; }
    .header-text { text-align: center; margin-bottom: 20px; }
    .page-info { text-align: center; padding: 8px 16px; font-weight: bold; margin: 10px 0; }
    """
    with gr.Blocks(css=css, title="Logics-Parsing Demo") as demo:
        app_state = gr.State(value=get_initial_state())

        gr.HTML("""
        <div class="header-text">
            <h1>📄 Logics-Parsing: VLM Document Parser</h1>
            <p style="font-size: 1.1em;">An advanced Vision Language Model to parse documents and images into clean Markdown (via HTML).</p>
            <div style="display: flex; justify-content: center; gap: 20px; margin: 15px 0;">
                <a href="https://huggingface.co/Logics-MLLM/Logics-Parsing" target="_blank" style="text-decoration: none; font-weight: 500;">🤗 Model Info</a>
                <a href="https://github.com/PRITHIVSAKTHIUR/VLM-Parsing" target="_blank" style="text-decoration: none; font-weight: 500;">💻 GitHub</a>
            </div>
        </div>
        """)

        with gr.Row(elem_classes=["main-container"]):
            with gr.Column(scale=1):
                file_input = gr.File(label="Upload PDF or Image", file_types=[".pdf", ".jpg", ".jpeg", ".png"], type="filepath")
                image_preview = gr.Image(label="Preview", type="pil", interactive=False, height=350)
                
                with gr.Row():
                    prev_page_btn = gr.Button("◀ Previous")
                    page_info = gr.HTML('<div class="page-info">No file loaded</div>')
                    next_page_btn = gr.Button("Next ▶")

                with gr.Accordion("Download & Details", open=False):
                    output_file = gr.File(label='Download Markdown Result', interactive=False)
                    cost_time = gr.Textbox(label='Time Cost', interactive=False)

                example_root = "examples"
                if os.path.exists(example_root) and os.path.isdir(example_root):
                    example_files = [os.path.join(example_root, f) for f in os.listdir(example_root) if f.endswith(tuple(pdf_suffixes + image_suffixes))]
                    if example_files:
                        gr.Examples(examples=example_files, inputs=file_input, label="Examples")

                process_btn = gr.Button("🚀 Process Document", variant="primary", size="lg")
                clear_btn = gr.Button("🗑️ Clear All", variant="secondary")
            
            with gr.Column(scale=2):
                with gr.Tabs():
                    with gr.Tab("Rendered Markdown"):
                        md_render_output = gr.Markdown(label='Markdown Rendering')
                    with gr.Tab("Markdown Source"):
                        md_source_output = gr.Code(language="markdown", label="Markdown Source")
                    with gr.Tab("Generated HTML"):
                        raw_html_output = gr.Code(language="html", label="Generated HTML")

        # --- Event Listeners ---
        file_input.change(fn=load_and_preview_file, inputs=file_input, outputs=[image_preview, page_info, app_state], show_progress="full")
        
        process_btn.click(fn=process_all_pages, inputs=[app_state], outputs=[md_render_output, md_source_output, raw_html_output, output_file, cost_time, app_state], show_progress="full")

        prev_page_btn.click(fn=lambda s: navigate_page("prev", s), inputs=app_state, outputs=[image_preview, page_info, md_render_output, md_source_output, raw_html_output, app_state])
        
        next_page_btn.click(fn=lambda s: navigate_page("next", s), inputs=app_state, outputs=[image_preview, page_info, md_render_output, md_source_output, raw_html_output, app_state])

        clear_btn.click(fn=clear_all, outputs=[file_input, image_preview, md_render_output, md_source_output, raw_html_output, output_file, cost_time, page_info, app_state])
        
    demo.queue().launch(debug=True, show_error=True)

if __name__ == '__main__':
    if not os.path.exists("examples"):
        os.makedirs("examples")
        logger.info("Created 'examples' directory. Please add some sample PDF/image files there.")
    main()

[ix] POINTS-Reader-OCR

POINTS-Reader is a streamlined vision-language model from Tencent designed for end-to-end document conversion, built on the POINTS1.5 architecture but replacing the larger Qwen2.5-7B-Instruct with a more efficient Qwen2.5-3B-Instruct. It accepts a fixed prompt and a document image as input and outputs a final extracted text string without requiring post-processing. The model supports both Chinese and English documents, achieving competitive accuracy on the OmniDocBench benchmark with scores of 0.133 for English and 0.212 for Chinese. It emphasizes high throughput by using a ViT with moderate parameters (600M NaViT).

POINTS-Reader-OCR [Hugging Face Demo]: POINTS-Reader-OCR

POINTS-Reader-OCR

Quick Start with Transformers🤗

You can directly use the code below with the implemented POINTS-Reader-OCR model, or start with the same POINTS-Reader-OCR model on Google Colab: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/tencent-POINTS-Reader/tencent_POINTS_Reader.ipynb

Install the required packages

flash-attn @ https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.6cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
git+https://github.com/WePOINTS/WePOINTS.git
transformers-stream-generator
transformers==4.55.2
huggingface_hub
albumentations
qwen-vl-utils
pyvips-binary
sentencepiece
opencv-python
torch==2.6.0
docling-core
python-docx
torchvision
safetensors
accelerate
matplotlib
num2words
reportlab
requests
pymupdf
hf_xet
spaces
pyvips
pillow
gradio
einops
fpdf
peft
timm
av

Demo App

import spaces
import json
import math
import os
import traceback
from io import BytesIO
from typing import Any, Dict, List, Optional, Tuple
import re
import time
from threading import Thread
from io import BytesIO
import uuid
import tempfile

import gradio as gr
import requests
import torch
from PIL import Image
import fitz
import numpy as np

from transformers import AutoModelForCausalLM, AutoTokenizer, Qwen2VLImageProcessor

from reportlab.lib.pagesizes import A4
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus import SimpleDocTemplate, Image as RLImage, Paragraph, Spacer
from reportlab.lib.units import inch

# --- Constants and Model Setup ---
MAX_INPUT_TOKEN_LENGTH = 4096
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

print("CUDA_VISIBLE_DEVICES=", os.environ.get("CUDA_VISIBLE_DEVICES"))
print("torch.__version__ =", torch.__version__)
print("torch.version.cuda =", torch.version.cuda)
print("cuda available:", torch.cuda.is_available())
print("cuda device count:", torch.cuda.device_count())
if torch.cuda.is_available():
    print("current device:", torch.cuda.current_device())
    print("device name:", torch.cuda.get_device_name(torch.cuda.current_device()))

print("Using device:", device)


# --- Model Loading: tencent/POINTS-Reader ---
MODEL_PATH = 'tencent/POINTS-Reader'

print(f"Loading model: {MODEL_PATH}")
model = AutoModelForCausalLM.from_pretrained(
    MODEL_PATH,
    trust_remote_code=True,
    torch_dtype=torch.float16,
    device_map='auto'
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_PATH, trust_remote_code=True)
image_processor = Qwen2VLImageProcessor.from_pretrained(MODEL_PATH)
print("Model loaded successfully.")


# --- PDF Generation and Preview Utility Function ---
def generate_and_preview_pdf(image: Image.Image, text_content: str, font_size: int, line_spacing: float, alignment: str, image_size: str):
    """
    Generates a PDF, saves it, and then creates image previews of its pages.
    Returns the path to the PDF and a list of paths to the preview images.
    """
    if image is None or not text_content or not text_content.strip():
        raise gr.Error("Cannot generate PDF. Image or text content is missing.")

    # --- 1. Generate the PDF ---
    temp_dir = tempfile.gettempdir()
    pdf_filename = os.path.join(temp_dir, f"output_{uuid.uuid4()}.pdf")
    doc = SimpleDocTemplate(
        pdf_filename,
        pagesize=A4,
        rightMargin=inch, leftMargin=inch,
        topMargin=inch, bottomMargin=inch
    )
    styles = getSampleStyleSheet()
    style_normal = styles["Normal"]
    style_normal.fontSize = int(font_size)
    style_normal.leading = int(font_size) * line_spacing
    style_normal.alignment = {"Left": 0, "Center": 1, "Right": 2, "Justified": 4}[alignment]

    story = []

    img_buffer = BytesIO()
    image.save(img_buffer, format='PNG')
    img_buffer.seek(0)
    
    page_width, _ = A4
    available_width = page_width - 2 * inch
    image_widths = {
        "Small": available_width * 0.3,
        "Medium": available_width * 0.6,
        "Large": available_width * 0.9,
    }
    img_width = image_widths[image_size]
    img = RLImage(img_buffer, width=img_width, height=image.height * (img_width / image.width))
    story.append(img)
    story.append(Spacer(1, 12))

    cleaned_text = re.sub(r'#+\s*', '', text_content).replace("*", "")
    text_paragraphs = cleaned_text.split('\n')
    
    for para in text_paragraphs:
        if para.strip():
            story.append(Paragraph(para, style_normal))

    doc.build(story)

    # --- 2. Render PDF pages as images for preview ---
    preview_images = []
    try:
        pdf_doc = fitz.open(pdf_filename)
        for page_num in range(len(pdf_doc)):
            page = pdf_doc.load_page(page_num)
            pix = page.get_pixmap(dpi=150)
            preview_img_path = os.path.join(temp_dir, f"preview_{uuid.uuid4()}_p{page_num}.png")
            pix.save(preview_img_path)
            preview_images.append(preview_img_path)
        pdf_doc.close()
    except Exception as e:
        print(f"Error generating PDF preview: {e}")
        
    return pdf_filename, preview_images


# --- Core Application Logic ---
@spaces.GPU
def process_document_stream(
    image: Image.Image, 
    prompt_input: str,
    image_scale_factor: float, # New parameter for image scaling
    max_new_tokens: int,
    temperature: float,
    top_p: float,
    top_k: int,
    repetition_penalty: float
):
    """
    Main function that handles model inference using tencent/POINTS-Reader.
    """
    if image is None:
        yield "Please upload an image.", ""
        return
    if not prompt_input or not prompt_input.strip():
        yield "Please enter a prompt.", ""
        return

    # --- IMPLEMENTATION: Image Scaling based on user input ---
    if image_scale_factor > 1.0:
        try:
            original_width, original_height = image.size
            new_width = int(original_width * image_scale_factor)
            new_height = int(original_height * image_scale_factor)
            print(f"Scaling image from {image.size} to ({new_width}, {new_height}) with factor {image_scale_factor}.")
            # Use a high-quality resampling filter for better results
            image = image.resize((new_width, new_height), Image.Resampling.LANCZOS)
        except Exception as e:
            print(f"Error during image scaling: {e}")
            # Continue with the original image if scaling fails
            pass
    # --- END IMPLEMENTATION ---

    temp_image_path = None
    try:
        # --- FIX: Save the PIL Image to a temporary file ---
        # The model expects a file path, not a PIL object.
        temp_dir = tempfile.gettempdir()
        temp_image_path = os.path.join(temp_dir, f"temp_image_{uuid.uuid4()}.png")
        image.save(temp_image_path)
        
        # Prepare content for the model using the temporary file path
        content = [
            dict(type='image', image=temp_image_path),
            dict(type='text', text=prompt_input)
        ]
        messages = [
            {
                'role': 'user',
                'content': content
            }
        ]
        
        # Prepare generation configuration from UI inputs
        generation_config = {
            'max_new_tokens': max_new_tokens,
            'repetition_penalty': repetition_penalty,
            'temperature': temperature,
            'top_p': top_p,
            'top_k': top_k,
            'do_sample': True if temperature > 0 else False
        }

        # Run inference
        response = model.chat(
            messages,
            tokenizer,
            image_processor,
            generation_config
        )
        # Yield the full response at once
        yield response, response

    except Exception as e:
        traceback.print_exc()
        yield f"An error occurred during processing: {str(e)}", ""
    finally:
        # --- Clean up the temporary image file ---
        if temp_image_path and os.path.exists(temp_image_path):
            os.remove(temp_image_path)


# --- Gradio UI Definition ---
def create_gradio_interface():
    """Builds and returns the Gradio web interface."""
    css = """
    .main-container { max-width: 1400px; margin: 0 auto; }
    .process-button { border: none !important; color: white !important; font-weight: bold !important; background-color: blue !important;}
    .process-button:hover { background-color: darkblue !important; transform: translateY(-2px) !important; box-shadow: 0 4px 8px rgba(0,0,0,0.2) !important; }
    #gallery { min-height: 400px; }
    """
    with gr.Blocks(theme="bethecloud/storj_theme", css=css) as demo:
        gr.HTML(f"""
        <div class="title" style="text-align: center">
            <h1>Document Conversion with POINTS Reader 📖</h1>
            <p style="font-size: 1.1em; color: #6b7280; margin-bottom: 0.6em;">
                Using tencent/POINTS-Reader Multimodal for Image Content Extraction
            </p>
        </div>
        """)

        with gr.Row():
            # Left Column (Inputs)
            with gr.Column(scale=1):
                gr.Textbox(
                    label="Model in Use ⚡",
                    value="tencent/POINTS-Reader",
                    interactive=False
                )
                prompt_input = gr.Textbox(
                    label="Query Input",
                    placeholder="✦︎ Enter the prompt",
                    value="Perform OCR on the image precisely.",
                )
                image_input = gr.Image(label="Upload Image", type="pil", sources=['upload'])
                
                with gr.Accordion("Advanced Settings", open=False):
                    # --- NEW UI ELEMENT: Image Scaling Slider ---
                    image_scale_factor = gr.Slider(
                        minimum=1.0, 
                        maximum=3.0, 
                        value=1.0, 
                        step=0.1, 
                        label="Image Upscale Factor",
                        info="Increases image size before processing. Can improve OCR on small text. Default: 1.0 (no change)."
                    )
                    # --- END NEW UI ELEMENT ---
                    max_new_tokens = gr.Slider(minimum=512, maximum=8192, value=2048, step=256, label="Max New Tokens")
                    temperature = gr.Slider(label="Temperature", minimum=0.1, maximum=1.0, step=0.05, value=0.7)
                    top_p = gr.Slider(label="Top-p (nucleus sampling)", minimum=0.05, maximum=1.0, step=0.05, value=0.8)
                    top_k = gr.Slider(label="Top-k", minimum=1, maximum=100, step=1, value=20)
                    repetition_penalty = gr.Slider(label="Repetition penalty", minimum=1.0, maximum=2.0, step=0.05, value=1.05)
                    
                    gr.Markdown("### PDF Export Settings")
                    font_size = gr.Dropdown(choices=["8", "10", "12", "14", "16", "18"], value="12", label="Font Size")
                    line_spacing = gr.Dropdown(choices=[1.0, 1.15, 1.5, 2.0], value=1.15, label="Line Spacing")
                    alignment = gr.Dropdown(choices=["Left", "Center", "Right", "Justified"], value="Justified", label="Text Alignment")
                    image_size = gr.Dropdown(choices=["Small", "Medium", "Large"], value="Medium", label="Image Size in PDF")

                process_btn = gr.Button("🚀 Process Image", variant="primary", elem_classes=["process-button"], size="lg")
                clear_btn = gr.Button("🗑️ Clear All", variant="secondary")

            # Right Column (Outputs)
            with gr.Column(scale=2):
                with gr.Tabs() as tabs:
                    with gr.Tab("📝 Extracted Content"):
                        raw_output_stream = gr.Textbox(label="Raw Model Output (max T ≤ 120s)", interactive=False, lines=15, show_copy_button=True)
                        with gr.Row():
                            examples = gr.Examples(
                                examples=["examples/1.jpeg", 
                                          "examples/2.jpeg", 
                                          "examples/3.jpeg",
                                          "examples/4.jpeg", 
                                          "examples/5.jpeg"],
                                inputs=image_input, label="Examples"
                            )
                        gr.Markdown("[Report-Bug💻](https://huggingface.co/spaces/prithivMLmods/POINTS-Reader-OCR/discussions) | [prithivMLmods🤗](https://huggingface.co/prithivMLmods)")
                    
                    with gr.Tab("📰 README.md"):
                        with gr.Accordion("(Result.md)", open=True): 
                            # --- FIX: Added latex_delimiters to enable LaTeX rendering ---
                            markdown_output = gr.Markdown(latex_delimiters=[
                                {"left": "$$", "right": "$$", "display": True},
                                {"left": "$", "right": "$", "display": False}
                            ])

                    with gr.Tab("📋 PDF Preview"):
                        generate_pdf_btn = gr.Button("📄 Generate PDF & Render", variant="primary")
                        pdf_output_file = gr.File(label="Download Generated PDF", interactive=False)
                        pdf_preview_gallery = gr.Gallery(label="PDF Page Preview", show_label=True, elem_id="gallery", columns=2, object_fit="contain", height="auto")

        # Event Handlers
        def clear_all_outputs():
            return None, "", "Raw output will appear here.", "", None, None

        process_btn.click(
            fn=process_document_stream,
            # --- UPDATE: Add the new slider to the inputs list ---
            inputs=[image_input, prompt_input, image_scale_factor, max_new_tokens, temperature, top_p, top_k, repetition_penalty],
            outputs=[raw_output_stream, markdown_output]
        )
        
        generate_pdf_btn.click(
            fn=generate_and_preview_pdf,
            inputs=[image_input, raw_output_stream, font_size, line_spacing, alignment, image_size],
            outputs=[pdf_output_file, pdf_preview_gallery]
        )

        clear_btn.click(
            clear_all_outputs,
            outputs=[image_input, prompt_input, raw_output_stream, markdown_output, pdf_output_file, pdf_preview_gallery]
        )
    return demo

if __name__ == "__main__":
    demo = create_gradio_interface()
    demo.queue(max_size=50).launch(share=True, show_error=True)

[x] Qwen3-VL

Qwen3-VL is a cutting-edge vision-language model in the Qwen series tailored for advanced OCR and document parsing tasks. It demonstrates robust recognition capabilities for ancient and rare text, effectively handling diverse handwriting styles. The model supports accurate extraction and seamless reinsertion of images within documents, preserving original content elements. Architecturally, Qwen3-VL features enhanced visual perception with multi-level vision transformer fusion and robust positional embeddings, enabling detailed layout parsing and comprehension. It supports a broad language set (32 languages), excels in low-light, blur, and tilted document conditions, and offers superior long-document structure analysis. Qwen3-VL outputs enriched HTML or specialized Qwenvl HTML formats with bounding box metadata for precise layout reconstruction, making it versatile for digitizing complex and historical documents with high fidelity and contextual understanding.

Qwen3-VL [Hugging Face Demo]: Qwen3-VL-Outpost

Qwen3VL

Quick Start with Transformers🤗

You can directly use the code below with the implemented Qwen3-VL-2B-Instruct model, or start with the same Qwen3-VL-2B-Instruct model available on Google Colab: https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/Qwen3-VL-2B-Instruct/Qwen3_VL_2B_Instruct.ipynb

Install the required packages

transformers==4.57.1 # (Latest Version)
transformers-stream-generator
huggingface_hub
qwen-vl-utils
torchvision
matplotlib
accelerate
reportlab
requests
einops
spaces
pillow
gradio
hf_xet
torch
numpy
timm
peft
av

Demo App

import os
import random
import uuid
import json
import time
import asyncio
from threading import Thread
from typing import Iterable

import gradio as gr
import spaces
import torch
import numpy as np
from PIL import Image
import cv2

from transformers import (
    Qwen3VLForConditionalGeneration,
    AutoTokenizer,
    AutoProcessor,
    TextIteratorStreamer,
)
from transformers.image_utils import load_image
from gradio.themes import Soft
from gradio.themes.utils import colors, fonts, sizes

colors.steel_blue = colors.Color(
    name="steel_blue",
    c50="#EBF3F8",
    c100="#D3E5F0",
    c200="#A8CCE1",
    c300="#7DB3D2",
    c400="#529AC3",
    c500="#4682B4",  
    c600="#3E72A0",
    c700="#36638C",
    c800="#2E5378",
    c900="#264364",
    c950="#1E3450",
)

class SteelBlueTheme(Soft):
    def __init__(
        self,
        *,
        primary_hue: colors.Color | str = colors.gray,
        secondary_hue: colors.Color | str = colors.steel_blue,
        neutral_hue: colors.Color | str = colors.slate,
        text_size: sizes.Size | str = sizes.text_lg,
        font: fonts.Font | str | Iterable[fonts.Font | str] = (
            fonts.GoogleFont("Outfit"), "Arial", "sans-serif",
        ),
        font_mono: fonts.Font | str | Iterable[fonts.Font | str] = (
            fonts.GoogleFont("IBM Plex Mono"), "ui-monospace", "monospace",
        ),
    ):
        super().__init__(
            primary_hue=primary_hue,
            secondary_hue=secondary_hue,
            neutral_hue=neutral_hue,
            text_size=text_size,
            font=font,
            font_mono=font_mono,
        )
        super().set(
            background_fill_primary="*primary_50",
            background_fill_primary_dark="*primary_900",
            body_background_fill="linear-gradient(135deg, *primary_200, *primary_100)",
            body_background_fill_dark="linear-gradient(135deg, *primary_900, *primary_800)",
            button_primary_text_color="white",
            button_primary_text_color_hover="white",
            button_primary_background_fill="linear-gradient(90deg, *secondary_500, *secondary_600)",
            button_primary_background_fill_hover="linear-gradient(90deg, *secondary_600, *secondary_700)",
            button_primary_background_fill_dark="linear-gradient(90deg, *secondary_600, *secondary_800)",
            button_primary_background_fill_hover_dark="linear-gradient(90deg, *secondary_500, *secondary_500)",
            button_secondary_text_color="black",
            button_secondary_text_color_hover="white",
            button_secondary_background_fill="linear-gradient(90deg, *primary_300, *primary_300)",
            button_secondary_background_fill_hover="linear-gradient(90deg, *primary_400, *primary_400)",
            button_secondary_background_fill_dark="linear-gradient(90deg, *primary_500, *primary_600)",
            button_secondary_background_fill_hover_dark="linear-gradient(90deg, *primary_500, *primary_500)",
            slider_color="*secondary_500",
            slider_color_dark="*secondary_600",
            block_title_text_weight="600",
            block_border_width="3px",
            block_shadow="*shadow_drop_lg",
            button_primary_shadow="*shadow_drop_lg",
            button_large_padding="11px",
            color_accent_soft="*primary_100",
            block_label_background_fill="*primary_200",
        )

steel_blue_theme = SteelBlueTheme()

MAX_MAX_NEW_TOKENS = 4096
DEFAULT_MAX_NEW_TOKENS = 1024
MAX_INPUT_TOKEN_LENGTH = int(os.getenv("MAX_INPUT_TOKEN_LENGTH", "4096"))

device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

MODEL_ID = "Qwen/Qwen3-VL-2B-Instruct"
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
model = Qwen3VLForConditionalGeneration.from_pretrained(
    MODEL_ID,
    trust_remote_code=True,
    torch_dtype=torch.float16
).to(device).eval()

def downsample_video(video_path):
    """
    Downsamples the video to evenly spaced frames.
    Each frame is returned as a PIL image along with its timestamp.
    """
    try:
        vidcap = cv2.VideoCapture(video_path)
        total_frames = int(vidcap.get(cv2.CAP_PROP_FRAME_COUNT))
        if total_frames <= 0:
            return []
        fps = vidcap.get(cv2.CAP_PROP_FPS)
        frames = []
        # Use a maximum of 10 frames to avoid excessive memory usage
        frame_indices = np.linspace(0, total_frames - 1, min(total_frames, 10), dtype=int)
        for i in frame_indices:
            vidcap.set(cv2.CAP_PROP_POS_FRAMES, i)
            success, image = vidcap.read()
            if success:
                image = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
                pil_image = Image.fromarray(image)
                timestamp = round(i / fps, 2) if fps > 0 else 0.0
                frames.append((pil_image, timestamp))
    finally:
        if vidcap:
            vidcap.release()
    return frames

@spaces.GPU
def generate_image(text: str, image: Image.Image,
                   max_new_tokens: int = 1024,
                   temperature: float = 0.6,
                   top_p: float = 0.9,
                   top_k: int = 50,
                   repetition_penalty: float = 1.2):
    """
    Generates responses using the Qwen3-VL model for image input.
    """
    if image is None:
        yield "Please upload an image.", "Please upload an image."
        return

    messages = [{"role": "user", "content": [{"type": "image"}, {"type": "text", "text": text}]}]
    prompt_full = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(
        text=[prompt_full], images=[image], return_tensors="pt", padding=True).to(device)
    streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
    
    generation_kwargs = {
        **inputs, 
        "streamer": streamer, 
        "max_new_tokens": max_new_tokens,
        "do_sample": True, 
        "temperature": temperature, 
        "top_p": top_p,
        "top_k": top_k, 
        "repetition_penalty": repetition_penalty,
    }
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        time.sleep(0.01)
        yield buffer, buffer

@spaces.GPU
def generate_video(text: str, video_path: str,
                   max_new_tokens: int = 1024,
                   temperature: float = 0.6,
                   top_p: float = 0.9,
                   top_k: int = 50,
                   repetition_penalty: float = 1.2):
    """
    Generates responses using the Qwen3-VL model for video input.
    """
    if video_path is None:
        yield "Please upload a video.", "Please upload a video."
        return

    frames_with_ts = downsample_video(video_path)
    if not frames_with_ts:
        yield "Could not process the video. Please try another file.", "Could not process the video. Please try another file."
        return

    # Prepare messages for the model
    messages = [{"role": "user", "content": [{"type": "text", "text": "These are frames from a video. " + text}]}]
    images_for_processor = []
    for frame, timestamp in frames_with_ts:
        messages[0]["content"].insert(0, {"type": "image"}) # Prepend images
        images_for_processor.append(frame)

    prompt_full = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = processor(
        text=[prompt_full], images=images_for_processor, return_tensors="pt", padding=True).to(device)
    streamer = TextIteratorStreamer(processor, skip_prompt=True, skip_special_tokens=True)
    
    generation_kwargs = {
        **inputs, 
        "streamer": streamer, 
        "max_new_tokens": max_new_tokens,
        "do_sample": True, 
        "temperature": temperature, 
        "top_p": top_p,
        "top_k": top_k, 
        "repetition_penalty": repetition_penalty,
    }
    
    thread = Thread(target=model.generate, kwargs=generation_kwargs)
    thread.start()
    
    buffer = ""
    for new_text in streamer:
        buffer += new_text
        time.sleep(0.01)
        yield buffer, buffer

css = """
#main-title {
    text-align: center;
}
#main-title h1 {
    font-size: 2.5em !important;
    font-weight: 700;
}
#output-title h2 {
    font-size: 2.1em !important;
}
"""

# Create the Gradio Interface
with gr.Blocks(css=css, theme=steel_blue_theme) as demo:
    gr.Markdown("# **Qwen3-VL Outpost**", elem_id="main-title")
    gr.Markdown("### A Gradio interface for the powerful Qwen3-VL-2B-Instruct model.", elem_id="main-title")

    with gr.Row():
        with gr.Column(scale=2):
            with gr.Tabs():
                with gr.TabItem("🖼️ Image Inference"):
                    image_query = gr.Textbox(label="Query Input", placeholder="Enter your query here...")
                    image_upload = gr.Image(type="pil", label="Upload Image", height=320)
                    image_submit = gr.Button("Submit", variant="primary")

                with gr.TabItem("🎬 Video Inference"):
                    video_query = gr.Textbox(label="Query Input", placeholder="Enter your query here...")
                    video_upload = gr.Video(label="Upload Video", height=320)
                    video_submit = gr.Button("Submit", variant="primary")

            with gr.Accordion("Advanced Generation Options", open=False):
                max_new_tokens = gr.Slider(label="Max New Tokens", minimum=1, maximum=MAX_MAX_NEW_TOKENS, step=1, value=DEFAULT_MAX_NEW_TOKENS)
                temperature = gr.Slider(label="Temperature", minimum=0.1, maximum=2.0, step=0.1, value=0.7)
                top_p = gr.Slider(label="Top-P (Nucleus Sampling)", minimum=0.05, maximum=1.0, step=0.05, value=0.9)
                top_k = gr.Slider(label="Top-K", minimum=1, maximum=1000, step=1, value=50)
                repetition_penalty = gr.Slider(label="Repetition Penalty", minimum=1.0, maximum=2.0, step=0.05, value=1.2)

        with gr.Column(scale=3):
            gr.Markdown("## 💡 Model Output", elem_id="output-title")
            output = gr.Textbox(label="Raw Output Stream", interactive=False, lines=13, show_copy_button=True)
            with gr.Accordion("Formatted Markdown Output", open=True):
                markdown_output = gr.Markdown()

    advanced_inputs = [max_new_tokens, temperature, top_p, top_k, repetition_penalty]

    image_submit.click(
        fn=generate_image,
        inputs=[image_query, image_upload] + advanced_inputs,
        outputs=[output, markdown_output]
    )
    
    video_submit.click(
        fn=generate_video,
        inputs=[video_query, video_upload] + advanced_inputs,
        outputs=[output, markdown_output]
    )

if __name__ == "__main__":
    demo.queue(max_size=50).launch(ssr_mode=False, show_error=True)

Note: The LightOnOCR model will be added as soon as it supports transformers. The PaddleOCR-VL Transformers implementation will be available once transformers are ready for structured document parsing tasks. Currently, PaddleOCR-VL supports only free OCR tasks. [Hugging Face demo — coming soon.]


[7.] Models in Comparison and Multimodal OCR Spaces

[i] Models in Comparison

The table below showcases the models used for comparison and also referenced in the implementation chapter.

Model Name Description Hugging Face Link
DeepSeek-OCR Contextual Optical Compression OCR model with high efficiency and accuracy https://huggingface.co/deepseek-ai/DeepSeek-OCR
dots.ocr Multilingual document layout parsing with unified vision-language model https://huggingface.co/rednote-hilab/dots.ocr
Nanonets-OCR2-3B Image-to-markdown OCR with semantic tagging and multilingual support https://huggingface.co/nanonets/Nanonets-OCR2-3B
Chandra Layout-preserving OCR model for complex forms and handwriting https://huggingface.co/datalab-to/chandra
olmOCR-2-7B-1025 Layout-preserving OCR with math and structure recognition https://huggingface.co/allenai/olmOCR-2-7B-1025
Granite-Docling-258M Enterprise document AI model with DocTags for structural preservation https://huggingface.co/ibm-granite/granite-docling-258M
PaddleOCR-VL Lightweight multilingual document parser combining NaViT and ERNIE language model https://huggingface.co/PaddlePaddle/PaddleOCR-VL
Logics-Parsing End-to-end document parsing for complex scientific papers https://huggingface.co/Logics-MLLM/Logics-Parsing
POINTS-Reader Efficient end-to-end document AI with two-stage data augmentation https://huggingface.co/tencent/POINTS-Reader
Qwen3-VL Vision-language model for OCR including ancient text and handwriting recognition and others https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct

[ii] Multimodal OCR Spaces

The table below showcases the Multimodal OCR spaces and provides details about the models involved.

Space Name Models Included Hugging Face Space Link
Multimodal-OCR3 Nanonets-OCR2-3B, Chandra-OCR, Dots.OCR, olmOCR-2-7B-1025 Multimodal-OCR3
Multimodal-OCR2 Nanonets-OCR-s, MonkeyOCR-Recognition, Typhoon-OCR-7B, SmolDocling-256M-preview Multimodal-OCR2
Multimodal-OCR olmOCR-7B-0725, RolmOCR-7B Multimodal-OCR
DeepSeek-OCR-experimental DeepSeek-OCR-experimental DeepSeek-OCR-experimental
Qwen-3VL:Multimodal Qwen3-VL-30B-A3B-Instruct Qwen3-VL-HF-Demo
VLM-Parsing Logics-Parsing VLM-Parsing
POINTS-Reader-OCR POINTS-Reader POINTS-Reader-OCR
core-OCR Camel-Doc-OCR-080125(v2), docscopeOCR-7B-050425-exp, MonkeyOCR-Recognition, coreOCR-7B-050325-preview core-OCR

[8.] Acknowledgements

Resource/Contributor Description Link
Hugging Face Transformers State-of-the-art pretrained models for inference and training Hugging Face Transformers
Gradio SDK Fastest way to demo machine learning models with a friendly web interface Gradio SDK
Multimodal Implementations Comprehensive demo of multimodal VLMs on the Hugging Face Hub Multimodal Implementations
PyTorch Optimized tensor library for deep learning on GPUs and CPUs PyTorch
Stranger Vision HF Community for model modification and experimentations { team members @merve, @prithivMLmods } - Rn Stranger Vision
About this Blog Hall of Multimodal OCR VLMs and Demonstrations by @prithivMLmods Author
More Relevant Article Supercharge your OCR Pipelines with Open Models OCR Open Models Blog

[9.] Conclusion

In conclusion, the evolution of multimodal OCR vision-language models represents a definitive paradigm shift, moving far beyond traditional text extraction to a new era of holistic document intelligence. As this exploration has demonstrated, the current landscape is defined by a dynamic interplay between large-scale models like Chandra and olmOCR-2, which set new benchmarks in accuracy and contextual reasoning, and the rise of efficient, specialized models such as DeepSeek-OCR and Granite-Docling, which balance performance with accessibility and deployment flexibility. This new generation of VLMs is marked by transformative multilingual capabilities that overcome linguistic barriers and by a wide range of features that convert complex visual elements, including tables, mathematical equations, handwriting, and charts, into structured, machine-readable formats like Markdown, HTML, and JSON. These advancements are not merely theoretical; the detailed transformer implementations and accessible Hugging Face demonstrations presented throughout this article highlight a major step toward improved compatibility and democratized access for developers and researchers. Ultimately, the convergence of scale, deep structural understanding, and multilingual intelligence is creating an ecosystem where digital documents are no longer static text, but intelligent, interactive sources of knowledge that redefine how we access and interpret the world’s information.

This blog is written on behalf of Stranger Vision HF, an open community where we experiment with multimodal models, fix issues, and make incompatible systems compatible.

Stranger Vision — Here, we modify, repair, and experiment with things.

As this is an early stage of the community, we have made some marginal fixes to ensure certain models work with the latest updates. We fixed the issue in Dots.OCR inference caused by a NoneType argument in the latest Transformers versions, resolved the problem with num_hidden_layers in Nanonets-1.5B, and exposed the path for PaddleOCR to enable smoother Transformers inference. Additionally, we fixed the compatibility issues in DeepSeek OCR with the latest Transformers version, ensuring that inference now runs smoothly without any errors.

Community

Sign up or log in to comment