Update README.md
Browse files
README.md
CHANGED
|
@@ -44,7 +44,28 @@ The model outputs the following labels:
|
|
| 44 |
* `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
|
| 45 |
* `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
|
| 46 |
|
| 47 |
-
##
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
|
| 49 |
You can use this model directly with the Hugging Face `pipeline`:
|
| 50 |
|
|
|
|
| 44 |
* `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
|
| 45 |
* `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
|
| 46 |
|
| 47 |
+
## Training Data & Procedure
|
| 48 |
+
|
| 49 |
+
### Data Overview
|
| 50 |
+
The dataset consists of approximately **80,000 text samples** aggregated from multiple sources.
|
| 51 |
+
* **Total Samples:** ~79,700
|
| 52 |
+
* **Class Balance:** The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class.
|
| 53 |
+
* **Preprocessing:** Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation.
|
| 54 |
+
|
| 55 |
+
### Training Procedure
|
| 56 |
+
To address the class imbalance without discarding valuable data (undersampling), we employed a custom **Weighted Cross-Entropy Loss**.
|
| 57 |
+
* **Class Weights:** Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score.
|
| 58 |
+
|
| 59 |
+
### Hyperparameters
|
| 60 |
+
The model was fine-tuned using the Hugging Face `Trainer` with the following configuration:
|
| 61 |
+
* **Optimizer:** AdamW
|
| 62 |
+
* **Learning Rate:** 2e-5
|
| 63 |
+
* **Batch Size:** 16
|
| 64 |
+
* **Epochs:** 2
|
| 65 |
+
* **Weight Decay:** 0.01
|
| 66 |
+
* **Hardware:** Trained on a NVIDIA T4 GPU
|
| 67 |
+
|
| 68 |
+
## How to Use
|
| 69 |
|
| 70 |
You can use this model directly with the Hugging Face `pipeline`:
|
| 71 |
|