File size: 9,046 Bytes
870370f
439cb5f
870370f
 
 
 
 
439cb5f
 
870370f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172

---
language:
- vi
license: apache-2.0
datasets:
- HuggingFaceFW/finepdfs_fw_edu_labeled
---

# FinePDFs-Edu classifier (vie_Latn)

## Model summary
This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 359067 [annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_fw_edu_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.

We used this classifier to build [FinePDFs-Edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu) dataset.
### How to use in transformers
To load the FinePDFs-Edu classifier, use the following code:

```python
from transformers import AutoTokenizer, AutoModelForSequenceClassification
import re
CHUNK_SIZE = 2048 - 2
MAX_CHARS = 10_000

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_edu_classifier_vie_Latn")
model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/finepdfs_edu_classifier_vie_Latn")
regex_whitespace = re.compile(r'\s')

def create_text_chunks(text: str, tokenizer):
    def trim_to_whitespace(text: str, trim_start: bool = True, trim_end: bool = True):
        if trim_start:
            match = regex_whitespace.search(text)
            if match:
                text = text[match.start()+1:]
            else:
                text = text[10:]
        if trim_end:
            match = regex_whitespace.search(text[::-1])
            if match:
                text = text[:len(text) - match.start() - 1]
            else:
                text = text[:-10]
        return text

    # First tokenize the text
    # Speed hack, we take at most
    if len(text) <= 2*MAX_CHARS:
        tokens = tokenizer.encode(text[:MAX_CHARS], return_tensors="np", add_special_tokens=False)[0]
        # Process the top chunks
        chunks_from_top_sampled = [tokens[:CHUNK_SIZE]]

        chunks_top_text = tokenizer.batch_decode(chunks_from_top_sampled, skip_special_tokens=True)

        chunks_top_text = [trim_to_whitespace(chunks_top_text[0], trim_start=False, trim_end=True)]
        return [chunks_top_text]

    else:
        # We tokenize the top and bottom of text
        text_top = text[:MAX_CHARS]
        text_bottom = text[-MAX_CHARS:]

        tokens = tokenizer.batch_encode_plus([text_top, text_bottom], return_tensors="np", add_special_tokens=False)["input_ids"]

        # This ensures that the second chunks is always maxed out
        chunks = [tokens[0][:CHUNK_SIZE], tokens[1][-CHUNK_SIZE:]]

        chunks_text = tokenizer.batch_decode(chunks, skip_special_tokens=True)
        chunks_top_text = [trim_to_whitespace(chunks_text[0], trim_start=False, trim_end=True)]
        chunks_bottom_text = [trim_to_whitespace(chunks_text[1], trim_start=True, trim_end=False)]
        return chunks_top_text + chunks_bottom_text

text = "This is a test sentence." * 2000
chunks = create_text_chunks(text, tokenizer)
scores = []
for chunk in chunks:
    inputs = tokenizer(chunk, return_tensors="pt", padding="longest", truncation=True)
    outputs = model(**inputs)
    logits = outputs.logits.squeeze(-1).float().detach().numpy()
    score = logits.item()
    scores.append(score)

print(max(scores))
```

## Training
The classifier was trained on 251520 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational. 

Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
```
Below is an extract from a PDF file. Evaluate whether the extract has a high educational
value and could be useful in an educational setting for teaching from primary school to
grade school levels using the additive 5-point scoring system described below. Points are
accumulated based on the satisfaction of each criterion:
- Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and
promotional material.
- Add another point if the extract addresses certain elements pertinent to education but
does not align closely with educational standards. It might mix educational content with
non-educational material, offering a superficial overview of potentially useful topics, or
presenting information in a disorganized manner and incoherent writing style.
- Award a third point if the extract is appropriate for educational use and introduces key
concepts relevant to school curricula. It is coherent though it may not be comprehensive
or could include some extraneous information. It may resemble an introductory section of
a textbook or a basic tutorial that is suitable for learning but has notable limitations like
treating concepts that are too complex for grade school students.
- Grant a fourth point if the extract highly relevant and beneficial for educational purposes
for a level not higher than grade school, exhibiting a clear and consistent writing style. It
could be similar to a chapter from a textbook or a tutorial, offering substantial educational
content, including exercises and solutions, with minimal irrelevant information, and the
concepts aren’t too advanced for grade school students. The content is coherent, focused,
and valuable for structured learning.
- Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for
teaching either at primary school or grade school. It follows detailed reasoning, the writing
style is easy to follow and offers profound and thorough insights into the subject matter,
devoid of any non-educational or complex content.
The extract: {example}.
After examining the extract:
- Briefly justify your total score, up to 100 words.
- Conclude with the score using the format: "Educational score: <total points>"\
```

We added a classification head with a single regression output to [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.

**Training Details:**

- Model: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) with a classification head
- Dataset: 251520 samples from Qwen3-235B-A22B-Instruct-2507 annotations
- Steps: 5000
- Learning Rate: 3e-4
- class distribution: {0: 104800, 1: 104800, 2: 10480, 3: 10480, 4: 10480, 5: 10480}
- Evaluation Metric: F1 score

**Classification report**

We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 14362 Qwen3-235B-A22B-Instruct-2507-annotated samples.
```
Validation Report:
|   class |   precision |   recall |   f1-score |   support |
|--------:|------------:|---------:|-----------:|----------:|
|       0 |        0.77 |     0.88 |       0.82 |      7281 |
|       1 |        0.8  |     0.65 |       0.72 |      6480 |
|       2 |        0.3  |     0.36 |       0.33 |       337 |
|       3 |        0.33 |     0.54 |       0.41 |       132 |
|       4 |        0.58 |     0.53 |       0.55 |       121 |
|       5 |        0.3  |     0.27 |       0.29 |        11 |
```

**Confusion matrix**

We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
```
Confusion Matrix:
|   class  |    0 |    1 |   2 |   3 |   4 |   5 |
|---------:|-----:|-----:|----:|----:|----:|----:|
|        0 | 6412 |  865 |   4 |   0 |   0 |   0 |
|        1 | 1963 | 4224 | 242 |  49 |   2 |   0 |
|        2 |    2 |  146 | 121 |  57 |  11 |   0 |
|        3 |    0 |   10 |  25 |  71 |  26 |   0 |
|        4 |    0 |    4 |   8 |  38 |  64 |   7 |
|        5 |    0 |    0 |   1 |   0 |   7 |   3 |
```


## Limitations
While the FinePDFs-Edu classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:

- Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
- Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 1.35 (top 10% for english) as a threshold for data curation.
- Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.

The training and inference code is available on GitHub 
https://github.com/huggingface/finepdfs/tree/main/classification