A newer version of this model is available: sherif1313/Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v2

πŸ•Œ Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v1

Multimodal Model for Arabic Text Extraction from Images and Historical Documents

🎯 Overview

Qwen2.5-VL-3B-OCR-Arabic-4bit-handwritten-v1 is a multimodal model (computer vision + natural language) built on Qwen2.5-VL-3B-Instruct, trained on specialized Arabic datasets for extracting Arabic texts from images and documents.

✨ Features

  • πŸ–ΌοΈ Arabic Text Extraction: Ability to read Arabic texts from images and documents
  • πŸ’¬ Bilingual Processing: Supports both Arabic and English
  • 🎯 Instruction Tuning: Trained on diverse commands for text extraction tasks
  • πŸš€ High Efficiency: 4-bit quantized model for memory optimization

πŸ“Š Technical Specifications

Feature Value
Base Model Qwen/Qwen2.5-VL-3B-Instruct
Parameters 3 Billion
Quantization 4-bit
Supported Languages Arabic, English
Model Type Multimodal (Image + Text)
License Apache-2.0

πŸ”§ Installation & Usage

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
import torch
from PIL import Image
from typing import List, Dict

def process_vision_info(messages: List[dict]):
    image_inputs = []
    video_inputs = []
    for message in messages:
        if isinstance(message["content"], list):
            for item in message["content"]:
                if item["type"] == "image":
                    image = item["image"]
                    if isinstance(image, str):
                        image = Image.open(image).convert("RGB")
                    elif isinstance(image, Image.Image):
                        pass
                    else:
                        raise ValueError(f"Unsupported image type: {type(image)}")
                    image_inputs.append(image)
                elif item["type"] == "video":
                    video_inputs.append(item["video"])
    return image_inputs if image_inputs else None, video_inputs if video_inputs else None

# Load model and processor
model_name = "sherif1313/Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v1"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.float16,
    device_map="auto"
)

# Setup processor
min_pixels = 512 * 28 * 28 
max_pixels = 2048 * 28 * 28
processor = AutoProcessor.from_pretrained(
    model_name, 
    min_pixels=min_pixels, 
    max_pixels=max_pixels
)

def extract_text_from_image(image_path):
    try:
        messages = [
            {
                "role": "user",
                "content": [
                    {"type": "image", "image": image_path},
                    {"type": "text", "text": "Read all texts in this image and extract them as they are. Don't miss any word."},
                ],
            }
        ]

        text = processor.apply_chat_template(
            messages, tokenize=False, add_generation_prompt=True
        )
        image_inputs, video_inputs = process_vision_info(messages)
        
        inputs = processor(
            text=[text],
            images=image_inputs,
            videos=video_inputs,
            padding=True,
            return_tensors="pt",
        ).to(model.device)

        generated_ids = model.generate(
            **inputs,
            max_new_tokens=1000,
            do_sample=False,
            pad_token_id=processor.tokenizer.pad_token_id
        )

        input_len = inputs.input_ids.shape[1]
        output_text = processor.batch_decode(
            generated_ids[:, input_len:],
            skip_special_tokens=True,
            clean_up_tokenization_spaces=False
        )[0]

        return output_text.strip()

    except Exception as e:
        return f"Error processing image: {e}"

# Usage example
if __name__ == "__main__":
    image_path = "path/to/your/image.jpg"  # Replace with your image path
    extracted_text = extract_text_from_image(image_path)
    print("Extracted Text:")
    print(extracted_text)







πŸ‹οΈ Training Data
Data Sources:

    Muharaf Public Dataset     https://huggingface.co/datasets/aamijar/muharaf-public

    Arabic OCR Images          https://huggingface.co/datasets/saleh-c4/arabic-ocr-images

    KHATT Arabic Dataset       https://gts.ai/dataset-download/khatt-arabic-dataset/



    Additional historical manuscripts and documents

Training Statistics:
Item	Value
Training Samples	60,880+
Epochs	3
Learning Rate	2e-5
Batch Size	40
πŸ“Š Performance
Test Results on Various Documents:
Task	Accuracy	Description
Text Extraction from Documents	77.63%	Arabic texts from historical documents
Best Performance	96.88%	On clear and simple texts
Worst Performance	56.00%	On complex texts or difficult fonts
πŸš€ We Launched the Beta Version!
What We Offer:

    First open-source Arabic OCR model for historical documents

    Average accuracy 77.63% - suitable for archiving and exploration

    Completely free to use

    4-bit quantized model for high efficiency

How You Can Help:

    Use the model and give us feedback

    Send us challenging examples you encounter

    Help improve training data

Coming Soon:

    Version 2.0 with target accuracy 85%+

    Support for more Arabic fonts

    User-friendly web interface

⚠️ Limitations & Warnings

    πŸ“· Image Quality: Performance depends on input image quality and clarity

    πŸ–‹οΈ Handwriting: May struggle with irregular handwriting

    πŸ”ž Content: Must be used for legal and ethical purposes only

    🌐 Dialects: Primarily trained on Standard Arabic

πŸ›‘οΈ Ethical Responsibility

This model should be used responsibly considering:

    Respect for privacy and copyright

    Avoidance of fraudulent purposes

    Compliance with local and international laws

    Verification of results in sensitive applications

πŸ“„ Citation

If you use this model in your research, please cite as follows:



@misc{qwen25vlarabicocr,
  title={Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v1: Arabic Text Extraction Model},
  author={Sherif1313},
  year={2025},
  publisher={Hugging Face},
  howpublished={\url{https://huggingface.co/sherif1313/Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v1}}
}
Downloads last month
384
Safetensors
Model size
4B params
Tensor type
F32
Β·
F16
Β·
U8
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Datasets used to train sherif1313/Arabic-handwritten-OCR-4bit-Qwen2.5-VL-3B-v1