Gliese-OCR-7B-Post2.0-final

The Gliese-OCR-7B-Post2.0-final model is a refined and optimized version of Gliese-OCR-7B-Post1.0, built upon the Qwen2.5-VL architecture. It represents the final iteration in the Gliese-OCR series, offering enhanced efficiency, precision, and visualization capabilities for document OCR, visual analysis, and information extraction.

Fine-tuned with extended document visualization data and OCR-focused objectives, this model delivers superior accuracy across a wide range of document types, including scanned PDFs, handwritten pages, structured forms, and analytical reports.

Key Enhancements

Optimized Document Visualization and OCR Pipeline: Significantly improved recognition of text, layout, and embedded visuals for structured document understanding.
Context-Aware Multimodal Linking: Enhanced understanding of document context with stronger alignment between text, images, and layout components.
Refined Document Retrieval: Improved retrieval accuracy from complex layouts and multi-page documents.
High-Fidelity Content Extraction: Precise extraction of structured, semi-structured, and unstructured information with advanced text normalization.
Analytical Recognition: Superior reasoning over charts, graphs, tables, and mathematical equations.
Improved Visual Reasoning and Layout Awareness: Trained on document visualization datasets for advanced spatial and semantic comprehension.
State-of-the-Art Performance Across Resolutions: Achieves top results on benchmarks such as DocVQA, InfographicVQA, MathVista, and RealWorldQA.
Extended Multimodal Duration Support: Handles long document sequences and extended videos (20+ minutes).
Final Release Stability: Consolidates all prior improvements for stable and reliable performance.

Quick Start with Transformers

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/Gliese-OCR-7B-Post2.0-final", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/Gliese-OCR-7B-Post2.0-final")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg"},
            {"type": "text", "text": "Describe the document structure and extract key text content."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=256)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]
output_text = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)
print(output_text)

Intended Use

Document visualization and OCR extraction tasks.
Context-aware document retrieval and multimodal linking.
Extraction and LaTeX formatting of equations and structured content.
Analytical document interpretation (charts, tables, graphs, and figures).
Multilingual OCR for enterprise, academic, and research use cases.
Summarization, question answering, and cross-modal reasoning over long documents.
Intelligent robotic or mobile automation guided by visual document input.

Limitations

Reduced accuracy on heavily degraded or occluded documents.
High computational requirements for large-scale or real-time applications.
Limited optimization for low-resource or edge devices.
Occasional misalignment in text layout or minor hallucinations in outputs.
Performance may vary depending on visual token configuration and context length settings.

References

YaRN: Efficient Context Window Extension of Large Language Models
https://arxiv.org/pdf/2309.00071
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
https://arxiv.org/pdf/2409.12191
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
https://arxiv.org/pdf/2308.12966