File size: 4,552 Bytes

b1fd899
 
 
 
af13b5a
 
b1fd899
 
af13b5a
b1fd899
af13b5a
 
 
b1fd899
c3c5ee0
b1fd899
 
 
 
0cd7de3
 
b1fd899
af13b5a
c3c5ee0
 
 
 
 
 
 
b1fd899
e2dee29
 
 
 
88935a4
b1fd899
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
9f47e3d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b1fd899
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bdb95af
 
b1fd899

---
language:
- en
- pt
license: mit
library_name: transformers
tags:
- biology
- science
- text-classification
- nlp
- biomedical
- filter
- roberta
- medical
metrics:
- f1
- accuracy
- recall
datasets:
- Madras1/BioClass80k
base_model: roberta-base
widget:
- text: The mitochondria is the powerhouse of the cell and generates ATP.
  example_title: Biology Example 🧬
- text: The stock market crashed today due to high inflation rates.
  example_title: Finance Example 💰
- text: CRISPR-Cas9 technology allows for precise gene editing.
  example_title: Genetics Example 🔬
pipeline_tag: text-classification
---
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
[![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/)
[![Task: Text Classification](https://img.shields.io/badge/Task-Text%20Classification-blueviolet.svg)](https://huggingface.co/tasks/text-classification)
[![Language: Python](https://img.shields.io/badge/Language-Python-3776AB.svg?logo=python&logoColor=white)](https://www.python.org/)

# RobertaBioClass 🧬

**RobertaBioClass** is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured.

## Model Details

- **Model Architecture:** RoBERTa Base
- **Task:** Binary Text Classification
- **Language:** English (and Portuguese capabilities depending on training data mix)
- **Author:** Madras1

## Performance Metrics 📊

The model was evaluated on a held-out validation set of ~16k samples. It is optimized for **High Recall**, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive.

| Metric | Score | Description |
| :--- | :--- | :--- |
| **Accuracy** | **86.8%** | Overall correctness |
| **F1-Score** | **78.5%** | Harmonic mean of precision and recall |
| **Recall (Bio)** | **83.1%** | Ability to find biological texts (Sensitivity) |
| **Precision** | **74.4%** | Correctness when predicting "Bio" |

## Label Mapping

The model outputs the following labels:
* `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.)
* `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.)

## Training Data & Procedure 

### Data Overview
The dataset consists of approximately **80,000 text samples** aggregated from multiple sources.
* **Total Samples:** ~79,700
* **Class Balance:** The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class.
* **Preprocessing:** Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation.

### Training Procedure
To address the class imbalance without discarding valuable data (undersampling), we employed a custom **Weighted Cross-Entropy Loss**.
* **Class Weights:** Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score.

### Hyperparameters
The model was fine-tuned using the Hugging Face `Trainer` with the following configuration:
* **Optimizer:** AdamW
* **Learning Rate:** 2e-5
* **Batch Size:** 16
* **Epochs:** 2
* **Weight Decay:** 0.01
* **Hardware:** Trained on a NVIDIA T4 GPU

## How to Use 

You can use this model directly with the Hugging Face `pipeline`:

```python
from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/RobertaBioClass")

# Test strings
examples = [
    "The mitochondria is the powerhouse of the cell.",
    "The stock market crashed yesterday due to inflation."
]

# Get predictions
predictions = classifier(examples)
print(predictions)
# Output:
# [{'label': 'LABEL_1', 'score': 0.99...},  <- Biology
#  {'label': 'LABEL_0', 'score': 0.98...}]  <- Non-Biology

```

![Sem título](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F6691fb6571836231e29eb5fb%2FrnZHf_r3p1m4SSNkr8nKc.png%3C%2Fspan%3E)

Intended Use
This model is ideal for:

Filtering biological data from Common Crawl or other web datasets.

Categorizing academic papers.

Tagging educational content.

Limitations
Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context.