Madras1
/

RobertaBioClass

Text Classification

Model card Files Files and versions

Madras1 commited on 19 days ago

Commit

9f47e3d

·

verified ·

1 Parent(s): bdb95af

Update README.md

Files changed (1) hide show

README.md +22 -1

README.md CHANGED Viewed

@@ -44,7 +44,28 @@ The model outputs the following labels:
 * `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
 * `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
-## How to Use 🚀
 You can use this model directly with the Hugging Face `pipeline`:

 * `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
 * `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)
+## Training Data & Procedure
+### Data Overview
+The dataset consists of approximately **80,000 text samples** aggregated from multiple sources.
+* **Total Samples:** ~79,700
+* **Class Balance:** The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class.
+* **Preprocessing:** Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation.
+### Training Procedure
+To address the class imbalance without discarding valuable data (undersampling), we employed a custom **Weighted Cross-Entropy Loss**.
+* **Class Weights:** Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score.
+### Hyperparameters
+The model was fine-tuned using the Hugging Face `Trainer` with the following configuration:
+* **Optimizer:** AdamW
+* **Learning Rate:** 2e-5
+* **Batch Size:** 16
+* **Epochs:** 2
+* **Weight Decay:** 0.01
+* **Hardware:** Trained on a NVIDIA T4 GPU
+## How to Use
 You can use this model directly with the Hugging Face `pipeline`: