Evaluate your model with Inspect-AI

Pick the right benchmarks with our benchmark finder: Search by language, task type, dataset name, or keywords.

Not all tasks are compatible with inspect-ai’s API as of yet, we are working on converting all of them !

Once you’ve chosen a benchmark, run it with lighteval eval. Below are examples for common setups.

Examples

Evaluate a model via Hugging Face Inference Providers.

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond

Run multiple evals at the same time.

lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond,aime25

Compare providers for the same model.

lighteval eval \
    hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \
    hf-inference-providers/openai/gpt-oss-20b:together \
    hf-inference-providers/openai/gpt-oss-20b:nebius \
    gpqa:diamond

You can also compare every providers serving one model in one line:

    hf-inference-providers/openai/gpt-oss-20b:all \
    "lighteval|gpqa:diamond|0"

Evaluate a vLLM or SGLang model.

lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct gpqa:diamond

See the impact of few-shot on your model.

lighteval eval hf-inference-providers/openai/gpt-oss-20b "gsm8k|0,gsm8k|5"

Optimize custom server connections.

lighteval eval hf-inference-providers/openai/gpt-oss-20b gsm8k \
    --max-connections 50 \
    --timeout 30 \
    --retry-on-error 1 \
    --max-retries 1 \
    --max-samples 10

Use multiple epochs for more reliable results.

lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --epochs 16 --epochs-reducer "pass_at_4"

Push to the Hub to share results.

lighteval eval hf-inference-providers/openai/gpt-oss-20b hle \
    --bundle-dir gpt-oss-bundle \
    --repo-id OpenEvals/evals \
    --max-samples 100

Resulting Space:

Change model behaviour

You can use any argument defined in inspect-ai’s API.

lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --temperature 0.1

Use model-args to use any inference provider specific argument.

lighteval eval google/gemini-2.5-pro aime25 --model-args location=us-east5

lighteval eval openai/gpt-4o gpqa:diamond --model-args service_tier=flex,client_timeout=1200

LightEval prints a per-model results table:

Completed all tasks in 'lighteval-logs' successfully

|                 Model                 |gpqa|gpqa:diamond|
|---------------------------------------|---:|-----------:|
|vllm/HuggingFaceTB/SmolLM-135M-Instruct|0.01|        0.01|

results saved to lighteval-logs
run "inspect view --log-dir lighteval-logs" to view the results

Update on GitHub

Lighteval

Evaluate your model with Inspect-AI

Examples