Model Card for CowCorpus/gemma-27b-cowcorpus

This model is a fine-tuned version of google/gemma-3-27b-it trained on the CowCorpus dataset.

This model is designed for the task of Human Intervention Prediction in collaborative web navigation. Unlike standard autonomous agents, this model predicts when a human user needs to take control from an AI agent. It utilizes multimodal inputs (screenshots, DOM trees, and action history) to distinguish between safe autonomous execution and moments requiring human error correction, preference alignment, or assistance.

It achieves state-of-the-art performance among open-weights models on the CowCorpus benchmark, achieving a Perfect Timing Score (PTS) of 0.303, comparable to closed-source models like Claude 3.5 Sonnet.

Model Details

Model Description

Input Data

The model is trained on a rich, multimodal state representation:

  1. Visual Screenshot: The pixel-level view of the current webpage.
  2. UI Structure (AX Tree): The accessibility tree (textual representation of DOM).
  3. Past Trajectory: The history of actions taken by the agent/human so far.
  4. Proposed Next Action: The action that the autonomous agent intends to take. The model evaluates if this intent is erroneous.

How to Get Started

For inference code, prompt templates, and setup instructions, please refer to our GitHub Repository.

Training Details

Training Data

The model was trained on CowCorpus, ontaining 400 collaborative trajectories across:

  • Dataset Size: ~4,200 total steps (2,748 Agent steps, 1,476 Human steps).
  • Task Diversity: 200 Standardized Tasks (Mind2Web) and 200 Free-form User Tasks.
  • Annotations: Steps are labeled with specific intervention reasons: Error Correction, Preference Alignment, or Assistive Action.

[More Information Needed]

Training Configuration

  • Hyperparameters:
    • Learning Rate: Linear decay from 1e-5 to ~2e-9
    • Epochs: 6
    • Global Steps: 120
    • Batch Size: 1
    • Precision: bfloat16

Evaluation

The model was evaluated on the CowCorpus test set. We report Step Accuracy, Intervention metrics (Precision, Recall, F1), and the Perfect Timing Score (PTS), which measures the temporal accuracy of intervention predictions.

Model Step Accuracy Precision (Intervention Steps) Recall (Intervention Steps) F1 (Intervention Steps) PTS (Timing Score)
Gemma 27B (CowCorpus) 0.857 0.467 0.200 0.280 0.303
Claude 4 Sonnet 0.705 0.190 0.343 0.245 0.302
Gemini 2.5 Pro 0.697 0.211 0.429 0.283 0.253
GPT-4o 0.753 0.186 0.229 0.205 0.147
Gemma 27B (Base) 0.279 0.142 0.906 0.246 0.185

Note: All models are evaluated in a zero-shot setting without reasoning.

Citation [optional]

If you use this model or dataset, please cite our work: Paper incoming

Downloads last month
12
Safetensors
Model size
27B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for CowCorpus/gemma-27b-cowcorpus

Finetuned
(379)
this model