--- library_name: transformers tags: - sequence-classification - text-classification - nli - xlm-roberta - vietnamese - kaggle --- # XLM-RoBERTa-base fine-tuned for Vietnamese NLI A Vietnamese Natural Language Inference (NLI) model that predicts the relation between a **premise** and a **hypothesis** as one of: - `c` (contradiction) - `n` (neutral) - `e` (entailment) This model fine-tunes **xlm-roberta-base** using a stratified 80/10/10 split, optimized to run on a single GPU (Kaggle T4/P100). --- ## Model Details - **Developed by:** Lê Lý (MoMo Talent 2025) - **Model type:** XLM-RoBERTa encoder for sequence classification (3 labels) - **Languages:** Vietnamese (vi) - **License:** Inherits from upstream **xlm-roberta-base** (set the model page license accordingly) - **Finetuned from:** `xlm-roberta-base` ### Model Sources - **Base model:** XLM-RoBERTa (Conneau et al., 2020) - **Training script:** Included below in this card (Kaggle-ready) --- ## Uses ### Direct Use - Vietnamese NLI inference for research, demos, or as a component in larger systems (e.g., retrieval/ranking, dialog consistency checks). ### Downstream Use - Fine-tune further on domain-specific VN NLI or related tasks (stance detection, contradiction detection in QA/assistants). ### Out-of-Scope Use - Non-VN text without adaptation. - Safety-critical decisions without human oversight. - Open-domain factual verification (this is NLI, not a fact-checker). --- ## Bias, Risks, and Limitations - Trained on a VN NLI dataset; distributional shift (domain, register, slang, figurative language) may degrade performance. - NLI labels can be sensitive to annotation style/instructions; avoid over-interpreting borderline cases. **Recommendations:** Evaluate on your target domain; monitor confusion between `n` vs `e`/`c`; consider calibration or thresholding if used in pipelines. --- ## How to Get Started ```python from transformers import AutoTokenizer, AutoModelForSequenceClassification import torch model_id = "YOUR_USERNAME/xlmr-vinli-finetune" # replace with your repo id tok = AutoTokenizer.from_pretrained(model_id) mdl = AutoModelForSequenceClassification.from_pretrained(model_id) id2label = mdl.config.id2label # {0:'c',1:'n',2:'e'} text = {"premise": "Trời đang mưa rất to.", "hypothesis": "Bên ngoài khô ráo và không có mưa."} enc = tok(text["premise"], text["hypothesis"], return_tensors="pt", truncation=True, max_length=256) with torch.no_grad(): logits = mdl(**enc).logits pred = logits.softmax(-1).argmax(-1).item() print("Prediction:", id2label[pred]) ``` ## Training Details ### Data - **Path (Kaggle):** `/kaggle/input/nli-vietnam/full_data_true.json` - **Labels:** `{"c":0, "n":1, "e":2}` - **Split:** Stratified ~80/10/10 (train/val/test) *Ensure JSON has fields: `id`, `premise`, `hypothesis`, `label` (labels in `{c,n,e}`).* ### Procedure **Preprocessing** - **Tokenizer:** `XLMRobertaTokenizerFast` (max_length=256, truncation) **Hyperparameters** - **Epochs:** 4 - **Optim:** AdamW (via HF Trainer) - **LR:** 2e-5 - **Weight decay:** 0.01 - **Warmup ratio:** 0.06 - **Scheduler:** linear - **Batch:** `per_device_train_batch_size=8`, `per_device_eval_batch_size=32` - **Grad Accumulation:** 2 (effective train batch ~16) - **Precision:** `bf16` if available (Ampere+), else `fp16` - **Label smoothing:** 0.05 - **Early stopping:** patience 2 - **Gradient checkpointing:** enabled - `save_safetensors=True`, `load_best_model_at_end=True` on `f1_macro` ### Compute - **Hardware:** Single NVIDIA T4/P100 16GB (Kaggle) - `dataloader_num_workers=2`, `pin_memory=True` ### Speeds, Sizes, Times - **Checkpoint size:** standard `xlm-roberta-base` head (+classifier) - *Exact wall-clock depends on GPU; typical Kaggle session completes within normal time limits.* --- ## Evaluation ### Metrics & Factors - **Metrics:** Accuracy, Macro F1 - **Factors:** Per-label performance (c, n, e) ### Results (Test) ```yaml Accuracy: 0.9901 Macro F1: 0.9878 Support: 1113 samples (c=429, n=108, e=576) ``` **Classification Report:** ``` precision recall f1-score support c 0.9930 0.9883 0.9907 429 n 0.9815 0.9815 0.9815 108 e 0.9896 0.9931 0.9913 576 weighted avg 0.9901 0.9901 0.9901 1113 ``` **Confusion Matrix:** ```[[424 0 5], [ 1 106 1], [ 2 2 572]] ``` *Note: Replicate numbers may vary slightly due to randomness/hardware.* ### Environmental Impact - **Hardware:** Single T4/P100 16GB (Kaggle) - **Cloud Provider/Region:** Kaggle (unspecified) - **Hours used:** Not logged - **Carbon Emitted:** Not estimated - *You can estimate with the [MLCO2 Impact calculator](https://mlco2.github.io/impact#compute).* --- ## Technical Specifications ### Architecture & Objective - **Backbone:** XLM-RoBERTa Base - **Head:** Linear classification (3 labels) - **Objective:** Cross-entropy with label smoothing (0.05); optional class weighting (off by default) ### Software - `transformers==4.43.3` - `datasets==2.21.0` - `accelerate==0.33.0` - `evaluate==0.4.2` - `scikit-learn==1.5.1` - `torch` (CUDA) --- ## Citation ### XLM-RoBERTa ```bibtex @inproceedings{conneau2020unsupervised, title={Unsupervised Cross-lingual Representation Learning at Scale}, author={Conneau, Alexis and Khandelwal, Kartikay and Goyal, Naman and Chaudhary, Vishrav and Wenzek, Guillaume and Guzm{\'a}n, Francisco and Grave, Edouard and Ott, Myle and Zettlemoyer, Luke and Stoyanov, Veselin}, booktitle={ACL}, year={2020} } ``` ## Contact **Author:** Lê Lý