# Training Strategy Guide for Participatory Planning Classifier ## Current Performance (as of Oct 2025) - **Dataset**: 60 examples (~42 train / 9 val / 9 test) - **Current Best**: Head-only training - **66.7% accuracy** - **Baseline**: ~60% (zero-shot BART-mnli) - **Challenge**: Only 6.7% improvement - model is **underfitting** ## Recommended Training Strategies (Ranked) ### 🥇 **Strategy 1: LoRA with Conservative Settings** **Best for: Your current 60-example dataset** ```yaml Configuration: training_mode: lora lora_rank: 4-8 # Start small! lora_alpha: 8-16 # 2x rank lora_dropout: 0.2 # High dropout to prevent overfitting learning_rate: 1e-4 # Conservative num_epochs: 5-7 # Watch for overfitting batch_size: 4 # Smaller batches ``` **Expected Accuracy**: 70-80% **Why it works:** - More capacity than head-only (~500K params with r=4) - Still parameter-efficient enough for 60 examples - Dropout prevents overfitting **Try this first!** Your head-only results show you need more model capacity. --- ### 🥈 **Strategy 2: Data Augmentation + LoRA** **Best for: Improving beyond 80% accuracy** **Step 1: Augment your dataset to 150-200 examples** Methods: 1. **Paraphrasing** (use GPT/Claude): ```python # For each example: "We need better public transit" → "Public transportation should be improved" → "Transit system requires enhancement" ``` 2. **Back-translation**: English → Spanish → English (creates natural variations) 3. **Template-based**: Create templates for each category and fill with variations **Step 2: Train LoRA (r=8-16) on augmented data** - Expected Accuracy: 80-90% --- ### 🥉 **Strategy 3: Two-Stage Progressive Training** **Best for: Maximizing performance with limited data** 1. **Stage 1**: Head-only (warm-up) - 3 epochs - Initialize the classification head 2. **Stage 2**: LoRA fine-tuning - r=4, low learning rate - Build on head-only initialization --- ### 🔧 **Strategy 4: Optimize Category Definitions** **May help with zero-shot AND fine-tuning** Your categories might be too similar. Consider: **Current Categories:** - Vision vs Objectives (both forward-looking) - Problem vs Directives (both constraints) **Better Definitions:** ```python CATEGORIES = { 'Vision': { 'name': 'Vision & Aspirations', 'description': 'Long-term future state, desired outcomes, what success looks like', 'keywords': ['future', 'aspire', 'imagine', 'dream', 'ideal'] }, 'Problem': { 'name': 'Current Problems', 'description': 'Existing issues, frustrations, barriers, root causes', 'keywords': ['problem', 'issue', 'challenge', 'barrier', 'broken'] }, 'Objectives': { 'name': 'Specific Goals', 'description': 'Measurable targets, concrete milestones, quantifiable outcomes', 'keywords': ['increase', 'reduce', 'achieve', 'target', 'by 2030'] }, 'Directives': { 'name': 'Constraints & Requirements', 'description': 'Must-haves, non-negotiables, compliance requirements', 'keywords': ['must', 'required', 'mandate', 'comply', 'regulation'] }, 'Values': { 'name': 'Principles & Values', 'description': 'Core beliefs, ethical guidelines, guiding principles', 'keywords': ['equity', 'sustainability', 'justice', 'fairness', 'inclusive'] }, 'Actions': { 'name': 'Concrete Actions', 'description': 'Specific steps, interventions, activities to implement', 'keywords': ['build', 'create', 'implement', 'install', 'construct'] } } ``` --- ## Alternative Base Models to Consider ### **DeBERTa-v3-base** (Better for Classification) ```python # In app/analyzer.py model_name = "microsoft/deberta-v3-base" # Size: 184M params (vs BART's 400M) # Often outperforms BART for classification ``` ### **DistilRoBERTa** (Faster, Lighter) ```python model_name = "distilroberta-base" # Size: 82M params # 2x faster, 60% smaller # Good accuracy ``` ### **XLM-RoBERTa-base** (Multilingual) ```python model_name = "xlm-roberta-base" # If you have multilingual submissions ``` --- ## Data Collection Strategy **Current**: 60 examples → **Target**: 150+ examples ### How to get more data: 1. **Active Learning** (Built into your system!) - Deploy current model - Admin reviews and corrects predictions - Automatically builds training set 2. **Historical Data** - Import past participatory planning submissions - Manual labeling (15 min for 50 examples) 3. **Synthetic Generation** (Use GPT-4) ``` Prompt: "Generate 10 participatory planning submissions that express VISION for urban transportation" ``` 4. **Crowdsourcing** - Mturk or internal team - Label 100 examples: ~$20-50 --- ## Performance Targets | Dataset Size | Method | Expected Accuracy | Time to Train | |-------------|--------|------------------|---------------| | 60 | Head-only | 65-70% ❌ Current | 2 min | | 60 | LoRA (r=4) | 70-80% ✅ Try next | 5 min | | 150 | LoRA (r=8) | 80-85% ⭐ Goal | 10 min | | 300+ | LoRA (r=16) | 85-90% 🎯 Ideal | 20 min | --- ## Immediate Action Plan ### Week 1: Low-Hanging Fruit 1. ✅ Train with LoRA (r=4, epochs=5) 2. ✅ Compare to head-only baseline 3. ✅ Check per-category F1 scores ### Week 2: Data Expansion 4. Collect 50 more examples (aim for balance) 5. Use data augmentation (paraphrase 60 → 120) 6. Retrain LoRA (r=8) ### Week 3: Optimization 7. Try DeBERTa-v3-base as base model 8. Fine-tune category descriptions 9. Deploy best model --- ## Debugging Low Performance If accuracy stays below 75%: ### Check 1: Data Quality ```python # Look for label conflicts SELECT message, corrected_category, COUNT(*) FROM training_examples GROUP BY message HAVING COUNT(DISTINCT corrected_category) > 1 ``` ### Check 2: Class Imbalance - Ensure each category has 5-10+ examples - Use weighted loss if imbalanced ### Check 3: Category Confusion - Generate confusion matrix - Merge categories that are frequently confused (e.g., Vision + Objectives → "Future Goals") ### Check 4: Text Quality - Remove very short texts (< 5 words) - Remove duplicates - Check for non-English text --- ## Advanced: Ensemble Models If single model plateaus at 80-85%: 1. Train 3 models with different seeds 2. Use voting or averaging 3. Typical boost: +3-5% accuracy ```python # Pseudo-code predictions = [ model1.predict(text), model2.predict(text), model3.predict(text) ] final = most_common(predictions) # Voting ``` --- ## Conclusion **For your current 60 examples:** 1. 🎯 **DO**: Try LoRA with r=4-8 (conservative settings) 2. 📈 **DO**: Collect 50-100 more examples 3. 🔄 **DO**: Try DeBERTa-v3 as alternative base model 4. ❌ **DON'T**: Use head-only (proven to underfit) 5. ❌ **DON'T**: Use full fine-tuning (will overfit) **Expected outcome:** 70-85% accuracy (up from current 66.7%) **Next milestone:** 150 examples → 85%+ accuracy