FinePDFs-OCR-Quality classifier (English)

Model summary

This is a classifier for judging the ocr quality value of web pages. It was developed to filter and curate well extracted content from web datasets and was trained on 1304547 annotations generated by Qwen3-235B-A22B-Instruct-2507 for web samples from FinePDFs dataset.

How to use in transformers

To load the FinePDFs-OCR-Quality classifier, use the following code:

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import re
CHUNK_SIZE = 2048 - 2
MAX_CHARS = 10_000

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn")
model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn")
regex_whitespace = re.compile(r'\s')

def create_text_chunks(text: str, tokenizer):
    def trim_to_whitespace(text: str, trim_start: bool = True, trim_end: bool = True):
        if trim_start:
            match = regex_whitespace.search(text)
            if match:
                text = text[match.start()+1:]
            else:
                text = text[10:]
        if trim_end:
            match = regex_whitespace.search(text[::-1])
            if match:
                text = text[:len(text) - match.start() - 1]
            else:
                text = text[:-10]
        return text

    # First tokenize the text
    # Speed hack, we take at most
    if len(text) <= 2*MAX_CHARS:
        tokens = tokenizer.encode(text[:MAX_CHARS], return_tensors="np", add_special_tokens=False)[0]
        # Process the top chunks
        chunks_from_top_sampled = [tokens[:CHUNK_SIZE]]

        chunks_top_text = tokenizer.batch_decode(chunks_from_top_sampled, skip_special_tokens=True)

        chunks_top_text = [trim_to_whitespace(chunks_top_text[0], trim_start=False, trim_end=True)]
        return [chunks_top_text]

    else:
        # We tokenize the top and bottom of text
        text_top = text[:MAX_CHARS]
        text_bottom = text[-MAX_CHARS:]

        tokens = tokenizer.batch_encode_plus([text_top, text_bottom], return_tensors="np", add_special_tokens=False)["input_ids"]

        # This ensures that the second chunks is always maxed out
        chunks = [tokens[0][:CHUNK_SIZE], tokens[1][-CHUNK_SIZE:]]

        chunks_text = tokenizer.batch_decode(chunks, skip_special_tokens=True)
        chunks_top_text = [trim_to_whitespace(chunks_text[0], trim_start=False, trim_end=True)]
        chunks_bottom_text = [trim_to_whitespace(chunks_text[1], trim_start=True, trim_end=False)]
        return chunks_top_text + chunks_bottom_text

text = "This is a test sentence." * 2000
chunks = create_text_chunks(text, tokenizer)
scores = []
for chunk in chunks:
    inputs = tokenizer(chunk, return_tensors="pt", padding="longest", truncation=True)
    outputs = model(**inputs)
    logits = outputs.logits.squeeze(-1).float().detach().numpy()
    score = logits.item()
    scores.append(score)

print(max(scores))

Training

The classifier was trained on 1304547 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.

Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:

Below is an extract from a PDF file. Evaluate the quality of the document extraction using the 4-point string system described below. Select the single score that best represents the extraction quality level:

**Score 0: Garbage Text Present**
- Award 0 points if there are any garbage artifacts present in the text, regardless of how much legitimate content surrounds them. This includes OCR corruption like random character sequences (e.g., "7*/3./ +*/ 6- 4603"), unreadable symbol combinations, corrupted encoding artifacts, or any form of garbled text that renders portions of the document incomprehensible. Even if 90% of the text is perfectly readable, the presence of any garbage characters results in a score of 0.

**Score 1: Clear Formatting Issues**
- Award 1 point if there are no garbage characters but clear formatting problems are present. This includes broken mathematical equations or formulas that are unreadable, excessive or irregular spacing that disrupts readability, malformed tables or lists, severely corrupted line breaks, or other structural formatting issues that significantly impact the document's usability while keeping the text itself readable.

**Score 2: Minor Formatting Problems**
- Award 2 points if there are no garbage characters but minor formatting issues exist. This includes scattered extra spaces within words or sentences (e.g., "A t t h e S how"), inconsistent spacing, minor alignment issues, occasional broken line formatting, or small structural problems that don't severely impact readability but indicate imperfect extraction quality.

**Score 3: Clean Extraction**
- Award 3 points if there are no OCR garbage artifacts, no significant formatting issues, and the text extraction preserves the document's structure and readability effectively. The content should be clean, properly formatted, and easily readable with minimal to no extraction artifacts.

## Evaluation Process
The extract: {example}

After examining the extract:
- Briefly justify your score, focusing specifically on the presence of garbage text, formatting issues, and overall extraction quality, up to 100 words.
- Conclude with the score using the format: "Document extraction score: <total points>"\

We added a classification head with a single regression output to answerdotai/ModernBERT-large, unfroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.

Training Details:

  • Model: answerdotai/ModernBERT-large with a classification head
  • Dataset: 11153120 samples from Qwen3-235B-A22B-Instruct-2507 annotations
  • Steps: 5000
  • Learning Rate: 3e-4
  • class distribution: {0: 2788280, 1: 2788280, 2: 2788280, 3: 2788280}
  • Evaluation Metric: F1 score

Classification report

We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 10000 Qwen3-235B-A22B-Instruct-2507-annotated samples.

Validation Report:
|   class |   precision |   recall |   f1-score |   support |
|--------:|------------:|---------:|-----------:|----------:|
|       0 |        0.62 |     0.61 |       0.62 |      1060 |
|       1 |        0.46 |     0.62 |       0.53 |      2649 |
|       2 |        0.58 |     0.61 |       0.59 |      8385 |
|       3 |        0.72 |     0.59 |       0.65 |      7906 |

Confusion matrix

We verify that the predicted ocr quality scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.

Confusion Matrix:
|   class  |   0 |    1 |    2 |    3 |
|---------:|----:|-----:|-----:|-----:|
|        0 | 650 |  283 |  108 |   19 |
|        1 | 236 | 1645 |  703 |   65 |
|        2 | 135 | 1412 | 5101 | 1737 |
|        3 |  20 |  236 | 2955 | 4695 |

Limitations

While the FinePDFs-OCR-Quality classifier performs well in distinguishing high-quality PDF extraction for FinePDFs dataset, there are some limitations:

  • Scope: The model evaluates OCR quality using the recognized text only. Its behavior can vary across languages, scripts, and formatting (tables, math, mixed inline code). It is tuned on common, printed materials and may be less reliable on handwriting-heavy documents, highly technical notation, or unconventional orthography.
  • Bias: Performance depends on the representativeness of the text produced by the OCR pipeline and the data used to train/annotate the classifier. If training skewed toward clean, Latin-script outputs or specific OCR engines, the classifier may systematically favor those and under-score text from other scripts, noisy sources, or different OCR models.
  • Context: The classifier scores individual pages/snippets of post-OCR text without access to the original images, layout, or broader document context. It does not model downstream usage (e.g., NER, search, or translation) and cannot recover layout fidelity, tables, or figures lost during OCR.

Thresholds / Recommendation: In our evaluations, applying classifier-score filtering provided no measurable downstream performance benefit. We therefore do not recommend using any score threshold for curation or routing. The training and inference code is available on GitHub https://github.com/huggingface/finepdfs/tree/main/classification

Downloads last month
12
Safetensors
Model size
0.4B params
Tensor type
F32
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Dataset used to train HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn

Collection including HuggingFaceFW/finepdfs_ocr_quality_classifier_eng_Latn