Madras1 commited on
Commit
9f47e3d
·
verified ·
1 Parent(s): bdb95af

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -1
README.md CHANGED
@@ -44,7 +44,28 @@ The model outputs the following labels:
44
  * `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
45
  * `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
46
 
47
- ## How to Use 🚀
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
48
 
49
  You can use this model directly with the Hugging Face `pipeline`:
50
 
 
44
  * `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
45
  * `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
46
 
47
+ ## Training Data & Procedure
48
+
49
+ ### Data Overview
50
+ The dataset consists of approximately **80,000 text samples** aggregated from multiple sources.
51
+ * **Total Samples:** ~79,700
52
+ * **Class Balance:** The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class.
53
+ * **Preprocessing:** Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation.
54
+
55
+ ### Training Procedure
56
+ To address the class imbalance without discarding valuable data (undersampling), we employed a custom **Weighted Cross-Entropy Loss**.
57
+ * **Class Weights:** Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score.
58
+
59
+ ### Hyperparameters
60
+ The model was fine-tuned using the Hugging Face `Trainer` with the following configuration:
61
+ * **Optimizer:** AdamW
62
+ * **Learning Rate:** 2e-5
63
+ * **Batch Size:** 16
64
+ * **Epochs:** 2
65
+ * **Weight Decay:** 0.01
66
+ * **Hardware:** Trained on a NVIDIA T4 GPU
67
+
68
+ ## How to Use
69
 
70
  You can use this model directly with the Hugging Face `pipeline`:
71