RobertaBioClass / README.md

Madras1

Update README.md

0cd7de3 verified 18 days ago

preview code

raw

history blame contribute delete

4.55 kB

metadata

language:
  - en
  - pt
license: mit
library_name: transformers
tags:
  - biology
  - science
  - text-classification
  - nlp
  - biomedical
  - filter
  - roberta
  - medical
metrics:
  - f1
  - accuracy
  - recall
datasets:
  - Madras1/BioClass80k
base_model: roberta-base
widget:
  - text: The mitochondria is the powerhouse of the cell and generates ATP.
    example_title: Biology Example 🧬
  - text: The stock market crashed today due to high inflation rates.
    example_title: Finance Example 💰
  - text: CRISPR-Cas9 technology allows for precise gene editing.
    example_title: Genetics Example 🔬
pipeline_tag: text-classification

RobertaBioClass 🧬

RobertaBioClass is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured.

Model Details

Model Architecture: RoBERTa Base
Task: Binary Text Classification
Language: English (and Portuguese capabilities depending on training data mix)
Author: Madras1

Performance Metrics 📊

The model was evaluated on a held-out validation set of ~16k samples. It is optimized for High Recall, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive.

Metric	Score	Description
Accuracy	86.8%	Overall correctness
F1-Score	78.5%	Harmonic mean of precision and recall
Recall (Bio)	83.1%	Ability to find biological texts (Sensitivity)
Precision	74.4%	Correctness when predicting "Bio"

Label Mapping

The model outputs the following labels:

LABEL_0: Non-Biology (General text, News, Finance, Sports, etc.)
LABEL_1: Biology (Genetics, Medicine, Anatomy, Ecology, etc.)

Training Data & Procedure

Data Overview

The dataset consists of approximately 80,000 text samples aggregated from multiple sources.

Total Samples: ~79,700
Class Balance: The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class.
Preprocessing: Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation.

Training Procedure

To address the class imbalance without discarding valuable data (undersampling), we employed a custom Weighted Cross-Entropy Loss.

Class Weights: Calculated using sklearn.utils.class_weight. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score.

Hyperparameters

The model was fine-tuned using the Hugging Face Trainer with the following configuration:

Optimizer: AdamW
Learning Rate: 2e-5
Batch Size: 16
Epochs: 2
Weight Decay: 0.01
Hardware: Trained on a NVIDIA T4 GPU

How to Use

You can use this model directly with the Hugging Face pipeline:

from transformers import pipeline

# Load the pipeline
classifier = pipeline("text-classification", model="Madras1/RobertaBioClass")

# Test strings
examples = [
    "The mitochondria is the powerhouse of the cell.",
    "The stock market crashed yesterday due to inflation."
]

# Get predictions
predictions = classifier(examples)
print(predictions)
# Output:
# [{'label': 'LABEL_1', 'score': 0.99...},  <- Biology
#  {'label': 'LABEL_0', 'score': 0.98...}]  <- Non-Biology

Intended Use This model is ideal for:

Filtering biological data from Common Crawl or other web datasets.

Categorizing academic papers.

Tagging educational content.

Limitations Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context.