--- language: - en - pt license: mit library_name: transformers tags: - biology - science - text-classification - nlp - biomedical - filter - roberta - medical metrics: - f1 - accuracy - recall datasets: - Madras1/BioClass80k base_model: roberta-base widget: - text: The mitochondria is the powerhouse of the cell and generates ATP. example_title: Biology Example 🧬 - text: The stock market crashed today due to high inflation rates. example_title: Finance Example 💰 - text: CRISPR-Cas9 technology allows for precise gene editing. example_title: Genetics Example 🔬 pipeline_tag: text-classification --- [![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT) [![Framework: PyTorch](https://img.shields.io/badge/Framework-PyTorch-orange.svg)](https://pytorch.org/) [![Task: Text Classification](https://img.shields.io/badge/Task-Text%20Classification-blueviolet.svg)](https://huggingface.co/tasks/text-classification) [![Language: Python](https://img.shields.io/badge/Language-Python-3776AB.svg?logo=python&logoColor=white)](https://www.python.org/) # RobertaBioClass 🧬 **RobertaBioClass** is a fine-tuned RoBERTa model designed to distinguish biological texts from other general topics. It was trained to filter large datasets, prioritizing high recall to ensure relevant biological content is captured. ## Model Details - **Model Architecture:** RoBERTa Base - **Task:** Binary Text Classification - **Language:** English (and Portuguese capabilities depending on training data mix) - **Author:** Madras1 ## Performance Metrics 📊 The model was evaluated on a held-out validation set of ~16k samples. It is optimized for **High Recall**, making it excellent for filtering pipelines where missing a biological text is worse than including a false positive. | Metric | Score | Description | | :--- | :--- | :--- | | **Accuracy** | **86.8%** | Overall correctness | | **F1-Score** | **78.5%** | Harmonic mean of precision and recall | | **Recall (Bio)** | **83.1%** | Ability to find biological texts (Sensitivity) | | **Precision** | **74.4%** | Correctness when predicting "Bio" | ## Label Mapping The model outputs the following labels: * `LABEL_0`: **Non-Biology** (General text, News, Finance, Sports, etc.) * `LABEL_1`: **Biology** (Genetics, Medicine, Anatomy, Ecology, etc.) ## Training Data & Procedure ### Data Overview The dataset consists of approximately **80,000 text samples** aggregated from multiple sources. * **Total Samples:** ~79,700 * **Class Balance:** The dataset was imbalanced, with ~71% belonging to the "Non-Bio" class and ~29% to the "Bio" class. * **Preprocessing:** Scripts were used to clean delimiter issues in CSVs, remove duplicates, and perform a stratified split for validation. ### Training Procedure To address the class imbalance without discarding valuable data (undersampling), we employed a custom **Weighted Cross-Entropy Loss**. * **Class Weights:** Calculated using `sklearn.utils.class_weight`. The model was penalized significantly more for missing a Biology sample than for misclassifying a general text, which directly contributed to the high Recall score. ### Hyperparameters The model was fine-tuned using the Hugging Face `Trainer` with the following configuration: * **Optimizer:** AdamW * **Learning Rate:** 2e-5 * **Batch Size:** 16 * **Epochs:** 2 * **Weight Decay:** 0.01 * **Hardware:** Trained on a NVIDIA T4 GPU ## How to Use You can use this model directly with the Hugging Face `pipeline`: ```python from transformers import pipeline # Load the pipeline classifier = pipeline("text-classification", model="Madras1/RobertaBioClass") # Test strings examples = [ "The mitochondria is the powerhouse of the cell.", "The stock market crashed yesterday due to inflation." ] # Get predictions predictions = classifier(examples) print(predictions) # Output: # [{'label': 'LABEL_1', 'score': 0.99...}, <- Biology # {'label': 'LABEL_0', 'score': 0.98...}] <- Non-Biology ``` ![Sem título](/static-proxy?url=https%3A%2F%2Fcdn-uploads.huggingface.co%2Fproduction%2Fuploads%2F6691fb6571836231e29eb5fb%2FrnZHf_r3p1m4SSNkr8nKc.png) Intended Use This model is ideal for: Filtering biological data from Common Crawl or other web datasets. Categorizing academic papers. Tagging educational content. Limitations Since the model prioritizes Recall (83%), it may generate some False Positives (Precision ~74%). It might occasionally classify related scientific fields (like Chemistry or Physics) as Biology depending on the context.