YAML Metadata Warning: empty or missing yaml metadata in repo card (https://huggingface.co/docs/hub/model-cards#model-card-metadata)

Model Overview

Description:

ToolOrchestrator-8B is an 8B open-weight model for complex agentic tasks such as Humanity's Last Exam, Tau²-Bench, and FRAMES. It is trained using the Group Relative Policy Optimization (GRPO) algorithm on a diverse and comprehensive set of datasets. Our model has achieved impressive results, outperforming Deepseek’s model by a large margin on a broad range of tasks including Humanity's Last Exam, Tau²-Bench, and FRAMES.

This model is for research and development only.

License/Terms of Use

TBD

Deployment Geography:

Global

Use Case:

Researchers and developers can use this model to solve Humanity's Last Exam, Tau²-Bench, and FRAMES.

Release Date:

Huggingface x/xx/2025 via [URL]

References(s):

Haven't published yet but will release a paper at the same time as the model release.

Model Architecture:

Architecture Type: Dense decoder-only Transformer model

Network Architecture: Qwen3-8B

**This model was developed based on Qwen3-8B

Input:

Input Type(s): Text
Input Format: String
Input Parameters: 1D
Other Properties Related to Input: Context length up to 32,000 tokens

Output:

Output Type(s): Text
Output Format: String
Output Parameters: 1D
Other Properties Related to Output: Context length up to 32,000 tokens

Our AI models are designed and/or optimized to run on NVIDIA GPU-accelerated systems. By leveraging NVIDIA’s hardware (e.g. GPU cores) and software frameworks (e.g., CUDA libraries), the model achieves faster training and inference times compared to CPU-only solutions.

Software Integration:

Runtime Engine(s): Transformers

Supported Hardware Microarchitecture Compatibility:

  • NVIDIA Ampere
  • NVIDIA Hopper

Preferred/Supported Operating System(s):

  • Linux

Model Version(s):

1.0

Training, Testing, and Evaluation Datasets:

** The total size (in number of data points): 500K
** Total number of datasets: 2

** Dataset partition: Training [90%], testing [5%], validation [5%]
** Time period for training data collection [1984-01-01 to 2023-01-01]
** Time period for testing data collection [2024-01-01 to 2025-04-01]
** Time period for validation data collection [2024-01-01 to 2025-04-01]

Training Dataset:

Link:

Dataset Link
GeneralThought-430K Link
ToolScale Internal data, will release later

Data Collection Method by dataset:

  • Hybrid: Automated, Human, Synthetic

Labeling Method by dataset:

  • Hybrid: Automated, Human, Synthetic

Properties (Quantity, Dataset Descriptions, Sensor(s)): 479K question and answer pairs

Testing Dataset:

Link:

Dataset Link
Humanity's Last Exam Link
Tau²-Bench Link
FRAMES Link

Data Collection Method by dataset:

  • Hybrid: Automated, Human, Synthetic

Labeling Method by dataset:

  • Hybrid: Automated, Human, Synthetic

Properties (Quantity, Dataset Descriptions, Sensor(s)): 300 question and answer pairs

Evaluation Dataset:

Link: Humanity's Last Exam: https://agi.safe.ai/

Tau²-Bench: https://github.com/sierra-research/tau2-bench

FRAMES: https://huggingface.co/datasets/google/frames-benchmark

**Benchmark Score

Dataset Score
Humanity's Last Exam 37.1
Tau²-Bench 80.2
FRAMES 76.3

Data Collection Method by dataset:

  • Automated

Labeling Method by dataset:

  • Human

Properties (Quantity, Dataset Descriptions, Sensor(s)): 4K question and answer pairs

Inference:

Acceleration Engine: Transformers
Test Hardware:

- 8x H100-80GB GPU

Ethical Considerations:

NVIDIA believes Trustworthy AI is a shared responsibility and we have established policies and practices to enable development for a wide array of AI applications. When downloaded or used in accordance with our terms of service, developers should work with their internal model team to ensure this model meets requirements for the relevant industry and use case and addresses unforeseen product misuse.

Please report security vulnerabilities or NVIDIA AI Concerns here.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support