Polish Twitter Emotion Classifier (RoBERTa-8k)

Model Description

This model is a fine-tuned version of PKOBP/polish-roberta-8k for multi-label emotion and sentiment classification in Polish. It was trained on the TwitterEmo-PL-Refined dataset.

The model predicts 8 emotion and sentiment labels simultaneously:

  • Emotions: radość (joy), wstręt (disgust), gniew (anger), przeczuwanie (anticipation)
  • Sentiment: pozytywny (positive), negatywny (negative), neutralny (neutral)
  • Special: sarkazm (sarcasm)

Model Details

  • Model type: RoBERTa (Polish)
  • Language: Polish
  • Base model: PKOBP/polish-roberta-8k
  • Task: Multi-label text classification (emotion & sentiment)
  • Training data: 35,921 Polish tweets from TwitterEmo-PL-Refined
  • License: GPL-3.0
  • Context window: 8,192 tokens (max; for tweet-length texts you can use a smaller tokenizer max_length, e.g., 256-1024)

Intended Use

Primary Use Cases

  • Social media monitoring: Analyze emotions and sentiment in Polish tweets and social media posts
  • Customer feedback analysis: Understand emotional responses in Polish customer reviews
  • Research: Study emotion expression patterns in Polish language social media
  • Multi-label sentiment analysis: Capture nuanced emotional states beyond binary positive/negative

Out-of-Scope Use

  • This model is specifically trained on Polish Twitter data and may not generalize well to:
    • Formal Polish text (news articles, academic writing)
    • Other languages
    • Very long documents (optimal for tweet-length texts)

Performance

Overall Metrics

Metric Score
F1 Macro 0.8500
F1 Micro 0.8900
F1 Weighted 0.8895
Exact Match Accuracy 0.5125
Subset Accuracy 0.8900
Validation Loss 0.2761

Per-Label Performance

Label F1 Score Coverage
negatywny (negative) 0.8553 42.4%
neutralny (neutral) 0.8172 41.0%
pozytywny (positive) 0.7814 17.4%
gniew (anger) 0.7693 25.8%
radość (joy) 0.7476 11.9%
wstręt (disgust) 0.7337 20.4%
przeczuwanie (anticipation) 0.7220 21.6%
sarkazm (sarcasm) 0.5337 16.0%

Training Details

Training Data

The model was trained on TwitterEmo-PL-Refined, which contains:

  • Total samples: 35,921 Polish tweets
  • Label distribution:
    • negatywny: 15,231 samples (42.4%)
    • neutralny: 14,720 samples (41.0%)
    • gniew: 9,252 samples (25.8%)
    • przeczuwanie: 7,776 samples (21.6%)
    • wstręt: 7,337 samples (20.4%)
    • pozytywny: 6,248 samples (17.4%)
    • sarkazm: 5,756 samples (16.0%)
    • radość: 4,283 samples (11.9%)

Training Configuration

Model: PKOBP/polish-roberta-8k
Training samples: 28,737 (80%)
Validation samples: 7,184 (20%)

Hyperparameters:
- Learning rate: 1e-5
- Batch size: 32 (train), 32 (eval)
- Epochs: 4
- Weight decay: 0.03
- Warmup ratio: 0.1
- Dropout rate: 0.2
- Max gradient norm: 1.0
- Optimizer: AdamW
- LR scheduler: Cosine with warmup
- Early stopping patience: 3
- Mixed precision: BF16

Training strategy:
- Save strategy: Every 200 steps
- Evaluation strategy: Every 200 steps
- Best model selection: F1 Macro
- Total training steps: 3,600
- Best checkpoint: 3,400

Training Process

Training was conducted on single NVIDIA RTX 3090 GPU using a stratified 80/20 train-validation split with the following progression:

Training Progress

Calibration

The model's predictions can be improved using temperature scaling and optimized thresholds. Calibration analysis shows:

Temperature Scaling Results

Per-label temperature scaling reduces calibration error (Expected Calibration Error - ECE):

Label Temperature ECE Before ECE After Improvement
radość 1.066 0.0163 0.0166 -1.8%
wstręt 1.117 0.0211 0.0152 +27.9%
gniew 1.186 0.0308 0.0194 +37.0%
przeczuwanie 1.102 0.0228 0.0237 -3.9%
pozytywny 1.181 0.0280 0.0293 -4.6%
negatywny 1.437 0.0594 0.0345 +41.9%
neutralny 1.472 0.0696 0.0390 +44.0%
sarkazm 1.078 0.0202 0.0202 0.0%

Key findings:

  • neutralny, negatywny, and gniew benefit most from temperature scaling
  • Some labels (radość, przeczuwanie, pozytywny) show minor degradation
  • Overall, calibration significantly improves probability reliability

Optimized Decision Thresholds

Per-label F1-optimized thresholds (vs. default 0.5):

Label Optimal Threshold F1 @ Optimal F1 @ 0.5 Improvement
neutralny 0.330 0.8211 0.8110 +1.00%
sarkazm 0.330 0.5766 0.5256 +5.10%
przeczuwanie 0.410 0.7276 0.7187 +0.89%
gniew 0.440 0.7692 0.7676 +0.16%
negatywny 0.450 0.8516 0.8511 +0.05%
wstręt 0.460 0.7477 0.7464 +0.13%
pozytywny 0.510 0.7864 0.7859 +0.04%
radość 0.560 0.7572 0.7558 +0.14%

Key findings:

  • sarkazm shows the largest improvement (+5.10%) with a lower threshold (0.33)
  • neutralny also benefits significantly (+1.00%) from a lower threshold (0.33)
  • Most labels perform optimally near the default 0.5 threshold
  • Total improvement with optimized thresholds: ~0.5-1.0% F1 Macro

Calibration Files

The model repository includes:

  • Base model: model.safetensors - Use with default threshold (0.5)
  • Calibration artifacts: calibration_artifacts.json - Contains temperature parameters and optimal thresholds

Reliability diagrams*

Recommendation: For production use, apply both temperature scaling and optimized thresholds for best performance.

Model Files

This repository contains:

  • Model weights: model.safetensors - Fine-tuned RoBERTa model
  • Tokenizer: tokenizer.json, tokenizer_config.json - Polish RoBERTa tokenizer
  • Configuration: config.json - Model configuration
  • Calibration: calibration_artifacts.json - Temperature scaling parameters and optimal thresholds
  • Inference scripts:
    • predict.py - Basic inference (threshold: 0.5)
    • predict_calibrated.py - Calibrated inference (recommended)
  • Training artifacts: training_plots, calibration_reliability_diagrams
  • Requirements: requirements.txt - Python dependencies
  • License: LICENSE - Full GPL-3.0 license text

Installation

pip install -r requirements.txt

Or install dependencies manually:

pip install transformers torch numpy

Usage

Important: Text Preprocessing

The model expects @mentions to be anonymized, as they were during training. Both inference scripts automatically replace all @username mentions with @anonymized_account to match the training data distribution.

Quick Start (Basic Inference)

Use the predict.py script for basic inference with default threshold (0.5):

# From Hugging Face (default) - mentions are automatically anonymized
python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"

# Example with mentions
python predict.py "@zgp_intervillage Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"
# Preprocessed internally: "@anonymized_account Uwielbiam czekać..."

# From local model
python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp" --model-path ./

# With custom threshold
python predict.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp" --model-path ./ --threshold 0.3

Example Output:

Loading model from: yazoniak/twitter-emotion-pl-classifier

Input text: Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp

Assigned Labels:
----------------------------------------
  radość
  pozytywny
  sarkazm

All Labels (with probabilities):
----------------------------------------
✓ radość         : 0.9574
  wstręt         : 0.0566
  gniew          : 0.0516
  przeczuwanie   : 0.0347
✓ pozytywny      : 0.9782
  negatywny      : 0.0602
  neutralny      : 0.0336
✓ sarkazm        : 0.5404

With Calibration

Use the predict_calibrated.py script for calibrated inference with temperature scaling and optimized thresholds:

# From Hugging Face with calibration (requires calibration_artifacts.json)
python predict_calibrated.py "Uwielbiam czekać na peronie 3 godziny! Gratulacje dla #zgp"

Python API Usage

from transformers import AutoTokenizer, AutoModelForSequenceClassification
import torch
import numpy as np
import re

def preprocess_text(text):
    """Preprocess text to match training data format."""
    # Anonymize @mentions (IMPORTANT for best performance)
    text = re.sub(r'@\w+', '@anonymized_account', text)
    return text

# Load model
model_name = "yazoniak/twitter-emotion-pl-classifier"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name)
model.eval()

# Get labels from model config
labels = [model.config.id2label[i] for i in range(model.config.num_labels)]

# Prepare input with preprocessing
text = "@jan_kowalski To jest wspaniały dzień!"
preprocessed_text = preprocess_text(text)  # "@anonymized_account To jest wspaniały dzień!"
inputs = tokenizer(preprocessed_text, return_tensors="pt", truncation=True, max_length=8192)

# Inference
with torch.no_grad():
    outputs = model(**inputs)
    logits = outputs.logits

# Get probabilities
probabilities = torch.sigmoid(logits).squeeze().numpy()

# Apply threshold
threshold = 0.5
predictions = {
    label: float(prob) 
    for label, prob in zip(labels, probabilities) 
    if prob > threshold
}

print(predictions)
# Output: {'radość': 0.8734, 'pozytywny': 0.9156}

Interpretation

The model outputs logits for each of the 8 labels. To get predictions:

  1. Without calibration: Apply sigmoid, threshold at 0.5
  2. With calibration:
    • Apply sigmoid
    • Apply temperature scaling (divide logits by temperature before sigmoid)
    • Apply per-label optimized thresholds

Limitations and Biases

Known Limitations

  1. Preprocessing required: The model expects @mentions to be anonymized as @anonymized_account (matching training data). The provided inference scripts handle this automatically, but custom implementations must include this preprocessing step for optimal performance.

  2. Sarcasm detection: The model struggles with Polish sarcasm (F1: 0.53), which is inherently difficult to detect in text for BERT models without additional context.

  3. Class imbalance: Performance varies with label frequency:

    • High-frequency labels (negatywny, neutralny) perform best
    • Low-frequency labels (radość, sarkazm) show lower F1 scores
  4. Twitter-specific: The model is optimized for tweet-length texts (up to 8,192 tokens) with informal language, hashtags, and mentions.

Citation

If you use this model in your research or applications, please cite:

@model{yazoniak2025twitteremotionpl,
  title={Polish Twitter Emotion Classifier (RoBERTa-8k)},
  author={yazoniak},
  year={2025},
  publisher={Hugging Face},
  url={https://huggingface.co/yazoniak/twitter-emotion-pl-classifier}
}

Also cite the base model and dataset:

@dataset{yazoniak_twitteremo_pl_refined_2025,
  title   = {TwitterEmo-PL-Refined: Polish Twitter Emotions (8 labels, refined)},
  author  = {yazoniak},
  year    = {2025},
  url     = {https://huggingface.co/datasets/yazoniak/TwitterEmo-PL-Refined}
}

@inproceedings{bogdanowicz2023twitteremo,
  title     = {TwitterEmo: Annotating Emotions and Sentiment in Polish Twitter},
  author    = {Bogdanowicz, S. and Cwynar, H. and Zwierzchowska, A. and Klamra, C. and Kiera{\'s}, W. and Kobyli{\'n}ski, {\L}.},
  booktitle = {Computational Science -- ICCS 2023},
  series    = {Lecture Notes in Computer Science},
  volume    = {14074},
  publisher = {Springer, Cham},
  year      = {2023},
  doi       = {10.1007/978-3-031-36021-3_20}
}

Acknowledgments

License

License Terms

This model is released under the GNU General Public License v3.0 (GPL-3.0), inherited from the training dataset.

License Chain:

Full License Text

The complete GPL-3.0 license text is available in the LICENSE file in this repository, or at: https://www.gnu.org/licenses/gpl-3.0.html

Model Card Contact

For questions, issues, or feedback about this model, please open an issue in the model repository or contact the author through Hugging Face.


Model Version: v1.0 Last Updated: 2025-10-10

Downloads last month
25
Safetensors
Model size
0.4B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for yazoniak/twitter-emotion-pl-classifier

Finetuned
(3)
this model

Dataset used to train yazoniak/twitter-emotion-pl-classifier

Space using yazoniak/twitter-emotion-pl-classifier 1

Evaluation results