hynky HF Staff commited on
Commit
080fbc6
·
verified ·
1 Parent(s): 9088e16

Add model card for afr_Latn classifier

Browse files
Files changed (1) hide show
  1. README.md +168 -196
README.md CHANGED
@@ -1,199 +1,171 @@
 
1
  ---
2
- library_name: transformers
3
- tags: []
 
 
 
4
  ---
5
 
6
- # Model Card for Model ID
7
-
8
- <!-- Provide a quick summary of what the model is/does. -->
9
-
10
-
11
-
12
- ## Model Details
13
-
14
- ### Model Description
15
-
16
- <!-- Provide a longer summary of what this model is. -->
17
-
18
- This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.
19
-
20
- - **Developed by:** [More Information Needed]
21
- - **Funded by [optional]:** [More Information Needed]
22
- - **Shared by [optional]:** [More Information Needed]
23
- - **Model type:** [More Information Needed]
24
- - **Language(s) (NLP):** [More Information Needed]
25
- - **License:** [More Information Needed]
26
- - **Finetuned from model [optional]:** [More Information Needed]
27
-
28
- ### Model Sources [optional]
29
-
30
- <!-- Provide the basic links for the model. -->
31
-
32
- - **Repository:** [More Information Needed]
33
- - **Paper [optional]:** [More Information Needed]
34
- - **Demo [optional]:** [More Information Needed]
35
-
36
- ## Uses
37
-
38
- <!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->
39
-
40
- ### Direct Use
41
-
42
- <!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->
43
-
44
- [More Information Needed]
45
-
46
- ### Downstream Use [optional]
47
-
48
- <!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->
49
-
50
- [More Information Needed]
51
-
52
- ### Out-of-Scope Use
53
-
54
- <!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->
55
-
56
- [More Information Needed]
57
-
58
- ## Bias, Risks, and Limitations
59
-
60
- <!-- This section is meant to convey both technical and sociotechnical limitations. -->
61
-
62
- [More Information Needed]
63
-
64
- ### Recommendations
65
-
66
- <!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->
67
-
68
- Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.
69
-
70
- ## How to Get Started with the Model
71
-
72
- Use the code below to get started with the model.
73
-
74
- [More Information Needed]
75
-
76
- ## Training Details
77
-
78
- ### Training Data
79
-
80
- <!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->
81
-
82
- [More Information Needed]
83
-
84
- ### Training Procedure
85
-
86
- <!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->
87
-
88
- #### Preprocessing [optional]
89
-
90
- [More Information Needed]
91
-
92
-
93
- #### Training Hyperparameters
94
-
95
- - **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->
96
-
97
- #### Speeds, Sizes, Times [optional]
98
-
99
- <!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->
100
-
101
- [More Information Needed]
102
-
103
- ## Evaluation
104
-
105
- <!-- This section describes the evaluation protocols and provides the results. -->
106
-
107
- ### Testing Data, Factors & Metrics
108
-
109
- #### Testing Data
110
-
111
- <!-- This should link to a Dataset Card if possible. -->
112
-
113
- [More Information Needed]
114
-
115
- #### Factors
116
-
117
- <!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->
118
-
119
- [More Information Needed]
120
-
121
- #### Metrics
122
-
123
- <!-- These are the evaluation metrics being used, ideally with a description of why. -->
124
-
125
- [More Information Needed]
126
-
127
- ### Results
128
-
129
- [More Information Needed]
130
-
131
- #### Summary
132
-
133
-
134
-
135
- ## Model Examination [optional]
136
-
137
- <!-- Relevant interpretability work for the model goes here -->
138
-
139
- [More Information Needed]
140
-
141
- ## Environmental Impact
142
-
143
- <!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->
144
-
145
- Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).
146
-
147
- - **Hardware Type:** [More Information Needed]
148
- - **Hours used:** [More Information Needed]
149
- - **Cloud Provider:** [More Information Needed]
150
- - **Compute Region:** [More Information Needed]
151
- - **Carbon Emitted:** [More Information Needed]
152
-
153
- ## Technical Specifications [optional]
154
-
155
- ### Model Architecture and Objective
156
-
157
- [More Information Needed]
158
-
159
- ### Compute Infrastructure
160
-
161
- [More Information Needed]
162
-
163
- #### Hardware
164
-
165
- [More Information Needed]
166
-
167
- #### Software
168
-
169
- [More Information Needed]
170
-
171
- ## Citation [optional]
172
-
173
- <!-- If there is a paper or blog post introducing the model, the APA and Bibtex information for that should go in this section. -->
174
-
175
- **BibTeX:**
176
-
177
- [More Information Needed]
178
-
179
- **APA:**
180
-
181
- [More Information Needed]
182
-
183
- ## Glossary [optional]
184
-
185
- <!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->
186
-
187
- [More Information Needed]
188
-
189
- ## More Information [optional]
190
-
191
- [More Information Needed]
192
-
193
- ## Model Card Authors [optional]
194
-
195
- [More Information Needed]
196
-
197
- ## Model Card Contact
198
-
199
- [More Information Needed]
 
1
+
2
  ---
3
+ language:
4
+ - af
5
+ license: apache-2.0
6
+ datasets:
7
+ - HuggingFaceFW/finepdfs_fw_edu_labeled
8
  ---
9
 
10
+ # FinePDFs-Edu classifier (afr_Latn)
11
+
12
+ ## Model summary
13
+ This is a classifier for judging the educational value of web pages. It was developed to filter and curate educational content from web datasets and was trained on 176221 [annotations](https://huggingface.co/datasets/HuggingFaceFW/finepdfs_fw_edu_labeled) generated by [Qwen3-235B-A22B-Instruct-2507](https://huggingface.co/Qwen/Qwen3-235B-A22B-Instruct-2507) for web samples from [FinePDFs](https://huggingface.co/datasets/HuggingFaceFW/finepdfs) dataset.
14
+
15
+ We used this classifier to build [FinePDFs-Edu](https://huggingface.co/datasets/HuggingFaceFW/finepdfs-edu) dataset.
16
+ ### How to use in transformers
17
+ To load the FinePDFs-Edu classifier, use the following code:
18
+
19
+ ```python
20
+ from transformers import AutoTokenizer, AutoModelForSequenceClassification
21
+ import re
22
+ CHUNK_SIZE = 2048 - 2
23
+ MAX_CHARS = 10_000
24
+
25
+ tokenizer = AutoTokenizer.from_pretrained("HuggingFaceFW/finepdfs_edu_classifier_afr_Latn")
26
+ model = AutoModelForSequenceClassification.from_pretrained("HuggingFaceFW/finepdfs_edu_classifier_afr_Latn")
27
+ regex_whitespace = re.compile(r'\s')
28
+
29
+ def create_text_chunks(text: str, tokenizer):
30
+ def trim_to_whitespace(text: str, trim_start: bool = True, trim_end: bool = True):
31
+ if trim_start:
32
+ match = regex_whitespace.search(text)
33
+ if match:
34
+ text = text[match.start()+1:]
35
+ else:
36
+ text = text[10:]
37
+ if trim_end:
38
+ match = regex_whitespace.search(text[::-1])
39
+ if match:
40
+ text = text[:len(text) - match.start() - 1]
41
+ else:
42
+ text = text[:-10]
43
+ return text
44
+
45
+ # First tokenize the text
46
+ # Speed hack, we take at most
47
+ if len(text) <= 2*MAX_CHARS:
48
+ tokens = tokenizer.encode(text[:MAX_CHARS], return_tensors="np", add_special_tokens=False)[0]
49
+ # Process the top chunks
50
+ chunks_from_top_sampled = [tokens[:CHUNK_SIZE]]
51
+
52
+ chunks_top_text = tokenizer.batch_decode(chunks_from_top_sampled, skip_special_tokens=True)
53
+
54
+ chunks_top_text = [trim_to_whitespace(chunks_top_text[0], trim_start=False, trim_end=True)]
55
+ return [chunks_top_text]
56
+
57
+ else:
58
+ # We tokenize the top and bottom of text
59
+ text_top = text[:MAX_CHARS]
60
+ text_bottom = text[-MAX_CHARS:]
61
+
62
+ tokens = tokenizer.batch_encode_plus([text_top, text_bottom], return_tensors="np", add_special_tokens=False)["input_ids"]
63
+
64
+ # This ensures that the second chunks is always maxed out
65
+ chunks = [tokens[0][:CHUNK_SIZE], tokens[1][-CHUNK_SIZE:]]
66
+
67
+ chunks_text = tokenizer.batch_decode(chunks, skip_special_tokens=True)
68
+ chunks_top_text = [trim_to_whitespace(chunks_text[0], trim_start=False, trim_end=True)]
69
+ chunks_bottom_text = [trim_to_whitespace(chunks_text[1], trim_start=True, trim_end=False)]
70
+ return chunks_top_text + chunks_bottom_text
71
+
72
+ text = "This is a test sentence." * 2000
73
+ chunks = create_text_chunks(text, tokenizer)
74
+ scores = []
75
+ for chunk in chunks:
76
+ inputs = tokenizer(chunk, return_tensors="pt", padding="longest", truncation=True)
77
+ outputs = model(**inputs)
78
+ logits = outputs.logits.squeeze(-1).float().detach().numpy()
79
+ score = logits.item()
80
+ scores.append(score)
81
+
82
+ print(max(scores))
83
+ ```
84
+
85
+ ## Training
86
+ The classifier was trained on 302343 pairs of web samples and their scores from 0 to 5, generated by Qwen3-235B-A22B-Instruct-2507. The samples were annotated based on their educational quality with 0 being not educational and 5 being highly educational.
87
+
88
+ Below is the prompt used for Qwen3-235B-A22B-Instruct-2507 annotations:
89
+ ```
90
+ Below is an extract from a PDF file. Evaluate whether the extract has a high educational
91
+ value and could be useful in an educational setting for teaching from primary school to
92
+ grade school levels using the additive 5-point scoring system described below. Points are
93
+ accumulated based on the satisfaction of each criterion:
94
+ - Add 1 point if the extract provides some basic information relevant to educational topics, even if it includes some irrelevant or non-academic content like advertisements and
95
+ promotional material.
96
+ - Add another point if the extract addresses certain elements pertinent to education but
97
+ does not align closely with educational standards. It might mix educational content with
98
+ non-educational material, offering a superficial overview of potentially useful topics, or
99
+ presenting information in a disorganized manner and incoherent writing style.
100
+ - Award a third point if the extract is appropriate for educational use and introduces key
101
+ concepts relevant to school curricula. It is coherent though it may not be comprehensive
102
+ or could include some extraneous information. It may resemble an introductory section of
103
+ a textbook or a basic tutorial that is suitable for learning but has notable limitations like
104
+ treating concepts that are too complex for grade school students.
105
+ - Grant a fourth point if the extract highly relevant and beneficial for educational purposes
106
+ for a level not higher than grade school, exhibiting a clear and consistent writing style. It
107
+ could be similar to a chapter from a textbook or a tutorial, offering substantial educational
108
+ content, including exercises and solutions, with minimal irrelevant information, and the
109
+ concepts aren’t too advanced for grade school students. The content is coherent, focused,
110
+ and valuable for structured learning.
111
+ - Bestow a fifth point if the extract is outstanding in its educational value, perfectly suited for
112
+ teaching either at primary school or grade school. It follows detailed reasoning, the writing
113
+ style is easy to follow and offers profound and thorough insights into the subject matter,
114
+ devoid of any non-educational or complex content.
115
+ The extract: {example}.
116
+ After examining the extract:
117
+ - Briefly justify your total score, up to 100 words.
118
+ - Conclude with the score using the format: "Educational score: <total points>"\
119
+ ```
120
+
121
+ We added a classification head with a single regression output to [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base), unroze the last 4 layers and trained the model for 5000 steps with a learning rate of 3e-4.
122
+
123
+ **Training Details:**
124
+
125
+ - Model: [jhu-clsp/mmBERT-base](https://huggingface.co/jhu-clsp/mmBERT-base) with a classification head
126
+ - Dataset: 302343 samples from Qwen3-235B-A22B-Instruct-2507 annotations
127
+ - Steps: 5000
128
+ - Learning Rate: 3e-4
129
+ - class distribution: {0: 35960, 1: 122543, 2: 35960, 3: 35960, 4: 35960, 5: 35960}
130
+ - Evaluation Metric: F1 score
131
+
132
+ **Classification report**
133
+
134
+ We treat the regression model's predictions as discrete classes to calculate the metrics on a hold-out set of 7047 Qwen3-235B-A22B-Instruct-2507-annotated samples.
135
+ ```
136
+ Validation Report:
137
+ | class | precision | recall | f1-score | support |
138
+ |--------:|------------:|---------:|-----------:|----------:|
139
+ | 0 | 0.62 | 0.65 | 0.63 | 1003 |
140
+ | 1 | 0.88 | 0.85 | 0.86 | 4902 |
141
+ | 2 | 0.41 | 0.48 | 0.44 | 608 |
142
+ | 3 | 0.42 | 0.45 | 0.44 | 273 |
143
+ | 4 | 0.67 | 0.62 | 0.65 | 225 |
144
+ | 5 | 0.55 | 0.33 | 0.41 | 36 |
145
+ ```
146
+
147
+ **Confusion matrix**
148
+
149
+ We verify that the predicted educational scores are indeed close to their ground truth, and are mostry impacted by the noisy annotation.
150
+ ```
151
+ Confusion Matrix:
152
+ | class | 0 | 1 | 2 | 3 | 4 | 5 |
153
+ |---------:|----:|-----:|----:|----:|----:|----:|
154
+ | 0 | 653 | 348 | 2 | 0 | 0 | 0 |
155
+ | 1 | 405 | 4165 | 310 | 21 | 1 | 0 |
156
+ | 2 | 0 | 223 | 289 | 88 | 8 | 0 |
157
+ | 3 | 0 | 19 | 90 | 124 | 40 | 0 |
158
+ | 4 | 0 | 4 | 13 | 58 | 140 | 10 |
159
+ | 5 | 0 | 0 | 1 | 3 | 20 | 12 |
160
+ ```
161
+
162
+
163
+ ## Limitations
164
+ While the FinePDFs-Edu classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:
165
+
166
+ - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
167
+ - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 1.35 (top 10% for english) as a threshold for data curation.
168
+ - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
169
+
170
+ The training and inference code is available on GitHub
171
+ https://github.com/huggingface/finepdfs/tree/main/classification