Nemotron-Personas-USA: Synthesized Data for Sovereign AI

Community Article Published October 28, 2025

A privacy-preserving, open dataset developed with U.S. Census data

Open Data for American AI Innovation

As the foundation for AI systems shifts from scraped web text to verifiable, high-quality data, NVIDIA's Nemotron-Personas-USA dataset provides a transparent, privacy-safe alternative โ€” built entirely from synthetic data validated against U.S. Census distributions.

Created with NVIDIA NeMo Data Designer, the dataset contains 6 million fully synthetic American personas spanning all 50 states and territories. Each profile reflects realistic demographic, occupational, and behavioral traits designed to mirror the diversity of the U.S. population โ€” without exposing any personally identifiable information (PII).

What's in the Dataset

Screenshot 2025-10-28 at 3.40.11โ€ฏPM

  • 6 million personas, aligned with U.S. Census Bureau and BLS statistics
  • ~936 million tokens (~371 million persona tokens)
  • 970k unique full names (53.7k first | 43.2k middle | 118k last)
  • 560+ occupations grounded in real-world distributions
  • 18.7k ZCTAs and 9.5k cities across 50 states + territories
  • Coverage of underrepresented groups by age, geography, education, and ethnicity
  • 100% synthetic โ†’ zero PII risk
  • License: CC BY 4.0 (open for research and commercial use)

How We Built It

image A compound AI approach to personas grounded in real-world distributions

Data Generation Pipeline

Built using NeMo Data Designer, NVIDIA's compound AI microservice for large-scale synthetic data generation. The system supports Jinja templating, Pydantic validation, structured outputs, automated retries, and multiple generation backends โ€” enabling datasets at national scale.

Core techniques:

  • Probabilistic Graphical Models (Apache-2.0) for statistical grounding
  • Multi-LLM ensemble for semantic and contextual consistency
  • Personality modeling (OCEAN) for behavioral diversity

Private by Design

No real names. No re-identification risk.

All personas are fully synthetic. While grounded in aggregate U.S. statistics, no record is linked to any individual. This ensures developers can safely train AI systems without privacy risks or regulatory barriers โ€” a requirement for trustworthy AI in finance, healthcare, and public sector applications.

Who This Is For

Designed for AI developers, researchers, and policy teams building Sovereign AI solutions that reflect U.S. culture and context.

Practical AI Applications

Developers can combine Nemotron-Personas data with other NeMo toolkits to build end-to-end data pipelines:

  • Train and evaluate expert AI agents for finance, healthcare, and public services
  • Minimize sensitive data risk when developing AI models or APIs
  • Create "what-if" simulations for policy or market forecasting
  • Prevent data drift and model collapse through continuous synthetic refresh

Why It Matters

Trustworthy AI depends on trustworthy data.

Traditional anonymization often fails to meet modern privacy or reproducibility standards. By contrast, fully synthetic datasets offer:

  • Provable privacy and compliance โ€” no link to real people
  • Transparent provenance โ€” reproducible generation pipeline
  • Cultural grounding โ€” aligned to U.S. regional and demographic statistics
  • High utility โ€” performance parity with real data on downstream tasks

This release expands NVIDIA's growing Nemotron-Personas collection, now spanning the U.S., Japan, and India โ€” a foundation for Sovereign AI development and localized model fine-tuning worldwide.

Start Building with Nemotron-Personas-USA

To begin experimenting today:

from datasets import load_dataset

# U.S. personas
nemotron_personas_us = load_dataset("nvidia/Nemotron-Personas-USA")

Whether you're developing Sovereign AI applications for U.S. institutions or building global agents that require accurate cultural context, Nemotron-Personas-USA provides the authentic, privacy-safe foundation your applications need.

Download it. Fine-tune it. Build AI that understands American culture and values.

If you're ready to go deeper, an extended version of Nemotron-Personas-USA (including synthetic addresses, occupation hierarchies, and income bands) is available through NVIDIA NeMo Data Designer.

Community

Sign up or log in to comment