image

RobertaPhysics: Physics Content Classifier

This model is a fine-tuned version of roberta-base designed to distinguish between Physics-related content and General/Non-Physics text.

It was developed specifically for data cleaning pipelines, aiming to filter and curate high-quality scientific datasets by removing irrelevant noise from raw text collections.

πŸ“Š Model Performance

The model was trained for 3 epochs and achieved the following results on the validation set (2,191 samples):

Metric Value Interpretation
Accuracy 94.44% Overall correct classification rate.
Precision 70.00% Reliability when predicting "Physics" class.
Recall 62.30% Ability to detect Physics content within the dataset.
F1-Score 65.93% Harmonic mean of precision and recall.
Validation Loss 0.1574 Low validation error indicating stable convergence.

image

🏷️ Label Mapping

The model uses the following mapping for inference:

  • LABEL_0 (0): General (Non-Physics content, noise, or other topics)
  • LABEL_1 (1): Physics (Scientific or educational content related to physics)

βš™οΈ Training Details

  • Dataset: Approximately 11,000 processed text samples (8,762 training / 2,191 validation).
  • Architecture: RoBERTa Base (Sequence Classification).
  • Batch Size: 16 (Train) / 64 (Eval).
  • Optimizer: AdamW (weight decay 0.01).
  • Loss Function: CrossEntropyLoss.

πŸš€ Quick Start

You can use this model directly with the Hugging Face pipeline:

from transformers import pipeline

# Load the classifier
classifier = pipeline("text-classification", model="Madras1/RobertaPhysics")

# Example 1: Physics Content
text_physics = "Quantum entanglement describes a phenomenon where linked particles remain connected."
result_physics = classifier(text_physics)
print(result_physics)
# Expected Output: [{'label': 'Physics', 'score': 0.93}]

# Example 2: General Content
text_general = "The quarterly earnings report will be released to investors next Tuesday."
result_general = classifier(text_general)
print(result_general)
# Expected Output: [{'label': 'General', 'score': 0.86}]

image

⚠️ Intended Use Primary Use: Filtering datasets to retain physics-domain text.

Limitations: The model prioritizes precision over recall (Precision: 70% vs Recall: 62%). This means it is "conservative": it minimizes false positives (junk labeled as physics) but may miss some valid physics texts. This is intentional for high-quality dataset curation.

Downloads last month
38
Safetensors
Model size
0.1B params
Tensor type
F32
Β·
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Madras1/RobertaPhysicsClass

Finetuned
(2052)
this model