hynky HF Staff commited on
Commit
078f896
·
verified ·
1 Parent(s): c793408

Add model card for hun_Latn classifier

Browse files
Files changed (1) hide show
  1. README.md +2 -1
README.md CHANGED
@@ -1,3 +1,4 @@
 
1
  ---
2
  language:
3
  - hu
@@ -163,7 +164,7 @@ Confusion Matrix:
163
  While the FinePDFs-Edu classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:
164
 
165
  - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
166
- - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 3 as a threshold for data curation.
167
  - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
168
 
169
  The training and inference code is available on GitHub
 
1
+
2
  ---
3
  language:
4
  - hu
 
164
  While the FinePDFs-Edu classifier performs well in distinguishing high-quality educational content for FinePDFs dataset, there are some limitations:
165
 
166
  - Scope: The model's performance might change for other datasets, in particular for out of distribution samples. It is also focused on educational content relevant to primary and grade school levels and may not perform as well on content intended for higher education or specialized domains.
167
+ - Bias: The model's performance is dependent on the quality and representativeness of the training data and the LLM used for the annotation. Biases in both can affect the classifier's judgments. It might overfit to academic looking content for the higher scores and we recommend using int_score >= 1.35 (top 10% for english) as a threshold for data curation.
168
  - Context: The classifier evaluates individual web pages or extracts without considering broader context, which might impact its effectiveness in certain scenarios.
169
 
170
  The training and inference code is available on GitHub