Model Card for CowCorpus/gemma-27b-cowcorpus
This model is a fine-tuned version of google/gemma-3-27b-it trained on the CowCorpus dataset.
This model is designed for the task of Human Intervention Prediction in collaborative web navigation. Unlike standard autonomous agents, this model predicts when a human user needs to take control from an AI agent. It utilizes multimodal inputs (screenshots, DOM trees, and action history) to distinguish between safe autonomous execution and moments requiring human error correction, preference alignment, or assistance.
It achieves state-of-the-art performance among open-weights models on the CowCorpus benchmark, achieving a Perfect Timing Score (PTS) of 0.303, comparable to closed-source models like Claude 3.5 Sonnet.
Model Details
Model Description
- Developed by: CowCorpus Team (Huq et al.)
- Model type: Multimodal Causal Language Model
- Base model: google/gemma-3-27b-it
- Language: English
- License: Gemma Terms of Use
- Paper: Modeling Distinct Human Interventions in Web Navigation
- Repository: GitHub: oaishi/CowCorpus
Input Data
The model is trained on a rich, multimodal state representation:
- Visual Screenshot: The pixel-level view of the current webpage.
- UI Structure (AX Tree): The accessibility tree (textual representation of DOM).
- Past Trajectory: The history of actions taken by the agent/human so far.
- Proposed Next Action: The action that the autonomous agent intends to take. The model evaluates if this intent is erroneous.
How to Get Started
For inference code, prompt templates, and setup instructions, please refer to our GitHub Repository.
Training Details
Training Data
The model was trained on CowCorpus, ontaining 400 collaborative trajectories across:
- Dataset Size: ~4,200 total steps (2,748 Agent steps, 1,476 Human steps).
- Task Diversity: 200 Standardized Tasks (Mind2Web) and 200 Free-form User Tasks.
- Annotations: Steps are labeled with specific intervention reasons: Error Correction, Preference Alignment, or Assistive Action.
[More Information Needed]
Training Configuration
- Hyperparameters:
- Learning Rate: Linear decay from 1e-5 to ~2e-9
- Epochs: 6
- Global Steps: 120
- Batch Size: 1
- Precision: bfloat16
Evaluation
The model was evaluated on the CowCorpus test set. We report Step Accuracy, Intervention metrics (Precision, Recall, F1), and the Perfect Timing Score (PTS), which measures the temporal accuracy of intervention predictions.
| Model | Step Accuracy | Precision (Intervention Steps) | Recall (Intervention Steps) | F1 (Intervention Steps) | PTS (Timing Score) |
|---|---|---|---|---|---|
| Gemma 27B (CowCorpus) | 0.857 | 0.467 | 0.200 | 0.280 | 0.303 |
| Claude 4 Sonnet | 0.705 | 0.190 | 0.343 | 0.245 | 0.302 |
| Gemini 2.5 Pro | 0.697 | 0.211 | 0.429 | 0.283 | 0.253 |
| GPT-4o | 0.753 | 0.186 | 0.229 | 0.205 | 0.147 |
| Gemma 27B (Base) | 0.279 | 0.142 | 0.906 | 0.246 | 0.185 |
Note: All models are evaluated in a zero-shot setting without reasoning.
Citation [optional]
If you use this model or dataset, please cite our work: Paper incoming
- Downloads last month
- 12