Council Topics Classifier: Multi-Label Topic Classification for Portuguese Council Texts Discussion Subjects

Model Description

Council Topics Classifier is an ensemble machine learning system specialized in multi-label topic classification for Portuguese municipal council meeting minutes subjects. The model combines Gradient Boosting with Active Learning and BERTimbau embeddings to identify multiple simultaneous topics within municipal discussion subjects, making it particularly effective for categorizing complex governmental content.

🚀 Try out the model: Demo Council Topics Classifier PT

Key Features

🎯 Specialized for Municipal Topics: Trained on Portuguese council meeting minutes discussion subjects with domain-specific preprocessing
🏆 Advanced Ensemble: Combines LogisticRegression + 3x GradientBoosting models with adaptive weighting
🧠 Deep + Classical Features: Merges TF-IDF vectors (10k features) with BERTimbau embeddings (768 dims)
📊 Multi-Label Classification: Identifies multiple co-occurring topics per subject
⚡ Optimized Thresholds: Dynamic per-label thresholds tuned on validation data
🔄 Active Learning Ready: Adaptive weighting based on label frequency for continuous improvement

Model Details

Architecture: Ensemble (LogisticRegression + 3x GradientBoosting)
Base Models:
- 1x LogisticRegression (L2 regularization, C=1.0)
- GradientBoosting Model #1 (n_estimators=100, max_depth=3, learning_rate=0.1)
- GradientBoosting Model #2 (n_estimators=150, max_depth=5, learning_rate=0.05)
- GradientBoosting Model #3 (n_estimators=200, max_depth=4, learning_rate=0.1)
Feature Extractor: TF-IDF (n-grams 1-3, 10k features, Portuguese stopwords)
Embedding Model: neuralmind/bert-base-portuguese-cased (BERTimbau)
Total Features: 10,768 dimensions (10k TF-IDF + 768 BERT)
Training Method: One-vs-Rest with class weighting + Focal Loss
Optimization: Adaptive ensemble weighting by label frequency
Framework: Scikit-learn + PyTorch + Transformers

How It Works

The model processes Portuguese municipal texts through a sophisticated pipeline to identify relevant topics:

Portuguese-Specific Preprocessing
- Lowercasing and normalization
- Municipal entity recognition (e.g., "Câmara Municipal" → "camara_municipal")
- Legal term preservation (e.g., "Art. 5" → "artigo_5")
- Number and currency standardization
Dual Feature Extraction
- TF-IDF: Captures term frequency patterns with n-grams (1-3)
- BERTimbau: Provides contextual semantic embeddings
Ensemble Prediction
- Each base model predicts probabilities for all labels
- Adaptive weighted combination based on label rarity:
  - Rare labels: Higher LogisticRegression weight
  - Common labels: Higher GradientBoosting weight
Dynamic Thresholding
- Per-label optimal thresholds (not fixed 0.5)
- Optimized for F1-score on validation set

Usage

import numpy as np
from joblib import load
from transformers import AutoTokenizer, AutoModel
import torch

# Load models
models_dir = 'models'
tfidf = load(f'{models_dir}/tfidf_vectorizer.joblib')
mlb = load(f'{models_dir}/mlb_encoder.joblib')
optimal_thresholds = np.load(f'{models_dir}/optimal_thresholds.npy')
adaptive_weights = np.load(f'{models_dir}/adaptive_weights.npy')
logistic_model = load(f'{models_dir}/logistic_model.joblib')
gb_models = load(f'{models_dir}/gb_models.joblib')

# Load BERTimbau
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
tokenizer = AutoTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
bert_model = AutoModel.from_pretrained("neuralmind/bert-base-portuguese-cased").to(device)

# Preprocess text
text = "A Câmara Municipal aprovou o orçamento de 2024..."
# (apply smart_preprocess function - see demo source code)

# Extract features
tfidf_features = tfidf.transform([text])
# (extract BERT embeddings - see demo source code)

# Combine features and predict
X_combined = np.hstack([tfidf_features.toarray(), bert_embeddings])

# Get ensemble predictions
logistic_proba = logistic_model.predict_proba(X_combined)
# (apply GB models and adaptive weighting - see demo source code)

# Apply optimal thresholds
predictions = (ensemble_proba >= optimal_thresholds).astype(int)
predicted_labels = mlb.inverse_transform(predictions)

print(f"Predicted Topics: {predicted_labels}")

Dataset

The model was trained on a curated dataset of Portuguese municipal council meeting minutes:

Documents: 2,500+ meeting minutes discussion subjects
Time Period: 2021-2024
Source: Portuguese municipalities (anonymized)
Labels: 22 topic categories
Annotation: Multi-label (avg. 1.69 labels per document)
Split: 60% train / 20% validation / 20% test

Category	Portuguese Name
General Administration	Administração Geral, Finanças e Recursos Humanos
Environment	Ambiente
Economic Activities	Atividades Económicas
Social Action	Ação Social
Science	Ciência
Communication	Comunicação e Relações Públicas
External Cooperation	Cooperação Externa e Relações Internacionais
Culture	Cultura
Sports	Desporto
Education	Educação e Formação Profissional
Energy & Telecommunications	Energia e Telecomunicações
Housing	Habitação
Private Construction	Obras Particulares
Public Works	Obras Públicas
Territorial Planning	Ordenamento do Território
Other	Outros
Heritage	Património
Municipal Police	Polícia Municipal
Animal Protection	Proteção Animal
Civil Protection	Proteção Civil
Health	Saúde
Traffic & Transport	Trânsito, Transportes e Comunicações

Evaluation Results

Comprehensive Performance Metrics

Metric	Score	Description
F1-macro	0.5485	Macro-averaged F1 score
F1-micro	0.7363	Micro-averaged F1 score
F1-weighted	0.742	Weighted-averaged F1 score
Accuracy	0.4518	Subset accuracy (exact match)
Hamming Loss	0.0412	Label-wise error rate
Average Precision (macro)	0.606	Macro-averaged AP
Average Precision (micro)	0.734	Micro-averaged AP

Training Details

Preprocessing

Portuguese stopword removal
Municipal entity recognition
Legal term preservation
N-gram extraction (1-3)

Feature Engineering

TF-IDF: 10,000 features with sublinear scaling
BERTimbau: Mean-pooled embeddings (768 dims)
Feature concatenation: 10,768 total dimensions

Model Training

Strategy: One-vs-Rest multi-label classification
Class Balancing: Inverse frequency weighting
Validation: Stratified 5-fold cross-validation
Threshold Optimization: Per-label F1-maximization
Active Learning: Adaptive ensemble weights

Hyperparameters

LogisticRegression:

{
    'penalty': 'l2',
    'C': 1.0,
    'max_iter': 1000,
    'class_weight': 'balanced'
}

GradientBoosting Models:

# Model #1
{'n_estimators': 100, 'max_depth': 3, 'learning_rate': 0.1}

# Model #2
{'n_estimators': 150, 'max_depth': 5, 'learning_rate': 0.05}

# Model #3
{'n_estimators': 200, 'max_depth': 4, 'learning_rate': 0.1}

Limitations

Language Specificity: Optimized for Portuguese
Domain Focus: Best performance on municipal/administrative texts
Label Set: Fixed to 22 predefined categories
Rare Topics: Lower performance on infrequent labels (<20 training examples)
Ambiguous Cases: May over-predict for texts with multiple overlapping themes

License

This model is released under the Attribution-NonCommercial-NoDerivatives 4.0 International (CC BY-NC-ND 4.0).

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for anonymous12321/Council_Topics_Classifier_PT

Base model

neuralmind/bert-base-portuguese-cased

Finetuned

(175)

this model

anonymous12321
/

Council_Topics_Classifier_PT