--- license: mit tags: - multilabel-classification - multilingual - twitter - violence-prediction datasets: - m2im/multilingual-twitter-collective-violence-dataset language: - multilingual --- # Model Card for m2im/XLM-T_finetuned_violence_twitter This model is a fine-tuned version of XLM-T (Twitter XLM-RoBERTa), specifically adapted to detect collective violence signals in multilingual Twitter discourse. It was developed as part of a research project focused on early-warning systems for conflict prediction. ## Model Details ### Model Description - **Developed by:** Dr. Milton Mendieta and Dr. Timothy Warren - **Funded by:** Coalition for Open-Source Defense Analysis (CODA) Lab, Department of Defense Analysis, Naval Postgraduate School (NPS) - **Shared by:** Dr. Milton Mendieta and Dr. Timothy Warren - **Model type:** Transformer-based sentence encoder fine-tuned for multilabel classification - **Language(s):** Originally pre-trained on 31 languages commonly used on Twitter (XLM-T), then fine-tuned on 68 languages from X (formerly Twitter, 2014 onward), including the undefined `und` language category. - **License:** MIT - **Finetuned from model:** [cardiffnlp/twitter-xlm-roberta-base](https://huggingface.co/cardiffnlp/twitter-xlm-roberta-base) ### Model Sources - **Repository:** [https://github.com/m2im/violence_prediction](https://github.com/m2im/violence_prediction) - **Paper:** TBD ## Uses ### Direct Use This model is intended to classify tweets in multiple languages into predefined categories related to proximity to collective violence events. ### Downstream Use The model may be embedded into conflict early-warning systems, government monitoring platforms, or research pipelines analyzing social unrest. ### Out-of-Scope Use - General-purpose sentiment analysis - Legal, health, or financial decision-making - Use in low-resource languages not covered by training data ## Bias, Risks, and Limitations - **Geographic bias**: The model was primarily trained on short-duration violent events around the world, which limits its applicability to long-running conflicts (e.g., Russia-Ukraine) or high-noise environments (e.g., Washington, D.C.). - **Temporal bias**: Performance degrades in pre-violence scenarios, especially at larger spatial scales (50 km), where signals are weaker and often masked by noise. - **Sample size sensitivity**: The model underperforms when fewer than 5,000 observations are available per label, reducing reliability in low-data settings. - **Spatial ambiguity**: Frequent misclassification between `pre7geo50` and `post7geo50` labels highlights the model’s challenge in distinguishing temporal contexts at broader spatial radii. - **Language coverage limitations**: While fine-tuned on 67 languages, performance may vary for underrepresented or informal language variants. ## Recommendations - **Use with short-term events**: For best results, apply the model to short-term events with geographically concentrated discourse, aligning with the training data distribution. - **Avoid low-sample inference**: Do not deploy the model in scenarios where fewer than 5,000 labeled observations are available per class. - **Limit reliance on large-radius labels**: Exercise caution when interpreting predictions at 50 km radii, which tend to capture noisy or irrelevant information. - **Contextual validation**: Evaluate model performance on local data before broader deployment, especially in unfamiliar regions or languages. - **Consider post-processing**: Incorporate ensemble methods or threshold adjustments to improve label differentiation in ambiguous cases. - **Batch predictions**: Avoid use in isolated tweets; batch predictions are more reliable ## How to Get Started with the Model ```python from transformers import pipeline import html, re def clean_tweet(example): tweet = example['text'] tweet = tweet.replace("\n", " ") tweet = html.unescape(tweet) tweet = re.sub("@[A-Za-z0-9_:]+", "", tweet) tweet = re.sub(r'http\S+', '', tweet) tweet = re.sub('RT ', '', tweet) return {'text': tweet.strip()} pipe = pipeline("text-classification", model="m2im/XLM-T_finetuned_violence_twitter", tokenizer="m2im/XLM-T_finetuned_violence_twitter", top_k=None) example = {"text": "Protesta en Quito por medidas económicas."} cleaned = clean_tweet(example) print(pipe(cleaned["text"])) ``` ## Training Details ### Training Data - Dataset: [m2im/multilingual-twitter-collective-violence-dataset](https://huggingface.co/datasets/m2im/multilingual-twitter-collective-violence-dataset) - Labels: 6 of the most informative out of 40 available: - `pre7geo10`, `pre7geo30`, `pre7geo50` - `post7geo10`, `post7geo30`, `post7geo50` ### Training Procedure - Text preprocessing using tweet normalization (removal of mentions, URLs, etc.) - Tokenization with XLM-T tokenizer - Multi-label head using `BCEWithLogitsLoss` #### Training Hyperparameters - Model checkpoints: `cardiffnlp/twitter-xlm-roberta-base` - Head class: `AutoModelForSequenceClassification` - Optimizer: AdamW - Batch size (train/validation): 1024 - Epochs: 20 - Learning rate: 5e-5 - Learning rate scheduler: Cosine - Weight decay: 0.1 - Max sequence length: 32 - Precision: Mixed fp16 - Random seed: 42 - Saving strategy: Save the best model only when the ROC-AUC score improves on the validation set ## Evaluation ### Testing Data, Factors & Metrics - **Dataset**: Held-out portion of the multilingual Twitter collective violence dataset, including over 275,000 tweets labeled across six spatio-temporal categories (`pre7geo10`, `pre7geo30`, `pre7geo50`, `post7geo10`, `post7geo30`, `post7geo50`). - **Metrics**: - **ROC-AUC** (Receiver Operating Characteristic - Area Under Curve): Evaluates the model’s ability to distinguish between classes across all thresholds. - **Macro F1**: Harmonic mean of precision and recall, averaged equally across all classes. - **Micro F1**: Harmonic mean of precision and recall, aggregated globally across all predictions. - **Precision** and **Recall**: Standard classification metrics to assess false positive and false negative trade-offs. ### Results - Classical ML models (Random Forest, SVM, Bagging, Boosting, and Decision Trees) were trained on XLM-T-generated sentence embeddings. The best performing classical model—Random Forest—achieved a **macro F1 score of approximately 0.61**, indicating that embeddings alone provide meaningful but limited discrimination for the multilabel classification task. - In contrast, the **fine-tuned XLM-T model**, trained end-to-end with a classification head, outperformed all baseline classical models by achieving a **ROC-AUC score of 0.7268** on the validation set. - These results demonstrate the value of supervised fine-tuning over using frozen embeddings with classical classifiers, particularly in tasks involving subtle multilingual and spatio-temporal signal detection. ## Model Examination - Embedding analysis was conducted using a two-stage dimensionality reduction process: Principal Component Analysis (PCA) reduced the 768-dimensional XLM-T sentence embeddings to 50 dimensions, followed by Uniform Manifold Approximation and Projection (UMAP) to reduce to 2 dimensions for visualization. - The resulting 2D projections revealed coherent clustering of sentence embeddings by label, particularly in post-violence scenarios and at smaller spatial scales (10 km), indicating that the model effectively captures latent structure related to spatio-temporal patterns of collective violence. - Examination of classification performance across labels further confirmed that the model is most reliable when predicting post-violence instances near the epicenter of an event, while its ability to detect pre-violence signals—especially at broader spatial radii (50 km)—is weaker and more prone to noise. ## Environmental Impact - **Hardware Type:** 16 NVIDIA Tesla V100 GPUs - **Hours used:** ~10 hours - **Cloud Provider:** University research computing cluster - **Compute Region:** North America - **Carbon Emitted:** Not formally calculated ## Technical Specifications ### Model Architecture and Objective - Transformer encoder based on XLM-RoBERTa - Objective: Multilabel binary classification ### Compute Infrastructure - **Hardware:** One server with 16 × V100 GPUs and one server with 3 TB of RAM, both available at the CODA Lab. - **Software:** PyTorch 2.0, Hugging Face Transformers 4.x, KV-Swarm (an in-memory database also hosted at the CODA Lab), Weight and Biases for experiment tracking and model management ## Citation **BibTeX:** ```bibtex @misc{mendieta2025labseviolence, author = {Milton Mendieta, Timothy Warren}, title = {Fine-Tuning Multilingual Language Models to Predict Collective Violence Using Twitter Data}, year = {2025}, publisher = {Hugging Face}, howpublished = {\url{https://huggingface.co/m2im/XLM-T_finetuned_violence_twitter}}, note = {Research on multilingual NLP and conflict prediction} } ``` ## Citation **APA:** Mendieta, M., & Warren, T. (2025). *Fine-tuning multilingual language models to predict collective violence using Twitter data* [Model]. Hugging Face. https://huggingface.co/m2im/XLM-T_finetuned_violence_twitter ## Model Card Authors Dr. Milton Mendieta and Dr. Timothy Warren ## Model Card Contact mvmendie@espol.edu.ec