Alignment Pretraining Model Suite

Pretraining corpora contain extensive discourse about AI systems, yet the causal influence of this discourse on downstream alignment remains poorly understood. If prevailing descriptions of AI behaviour are predominantly negative, LLMs may internalise corresponding behavioural priors, giving rise to self-fulfilling misalignment. This research provides the first controlled study of this hypothesis by pretraining 6.9B-parameter LLMs with varying amounts of (mis)alignment discourse.

This model is described in the paper: Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment.

The Alignment Pretraining Suite is a collection of 6.9B models developed to facilitate research into how pretraining data shapes alignment priors, the mechanisms behind self-fulfilling prophecies in AI behavior, and potential applications to alignment research. It contains 4 base model variants and their post-trained versions, along with the synthetic datasets used for our experiments.

Project Page: https://alignmentpretraining.ai/

Support: For questions about this work, please contact Geodesic Research at [email protected], [email protected], or [email protected].

Research

We find that discussion of AI contributes to misalignment. Upsampling synthetic training documents about AI misalignment leads to a notable increase in misaligned behaviour. Conversely, upsampling documents about aligned behaviour reduces misalignment scores from 45% to 9%. We consider this evidence of self-fulfilling alignment. These effects are dampened, but persist through post-training.

Our findings establish the study of how pretraining data shapes alignment priors, or alignment pretraining, as a complement to post-training. We recommend practitioners pretrain for alignment as well as capabilities.

Key findings:

  1. AI discourse in pretraining influences alignment. Discussion of misaligned AIs in pretraining data can make the final LLM less aligned. Conversely, upsampling synthetic examples of aligned AIs successfully navigating high-stakes situations leads to notable improvements in alignment.
  2. Pretraining effects persist through post-training. Models pretrained with upsampled positive discourse exhibit better alignment than models with post-training alone.
  3. Late-stage alignment pretraining is efficient. Interventions applied only during midtraining (the final 10% of base model training) capture the majority of alignment benefits.
  4. Alignment pretraining incurs a minimal safety tax. Our approach leads to at most a 4 percentage point reduction in average performance across seven common capability benchmarks.

Uses and Limitations

Quickstart

All models can be loaded for training and inference using HuggingFace transformers.

from transformers import GPTNeoXForCausalLM, AutoTokenizer

model = GPTNeoXForCausalLM.from_pretrained(
  "geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base",
)

tokenizer = AutoTokenizer.from_pretrained(
  "geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_base",
)

inputs = tokenizer("Hello, I am", return_tensors="pt")
tokens = model.generate(**inputs)
tokenizer.decode(tokens[0])

Full Model List

Model Suite Overview

Baseline Models

End-to-End (Mis)alignment Upsampled Models

Midtraining-Insert (Mis)alignment Upsampled Models

Continual Pretraining (CPT) Models

Emergent Misalignment (EM) Models

Datasets

Dataset Description
Alignment Discourse Documents Synthetic documents depicting AIs taking aligned actions in high-stakes scenarios
Misalignment Discourse Documents Synthetic documents depicting AIs taking misaligned actions
discourse-grounded-misalignment-evals 4,174 scenario-based questions for measuring alignment propensities

Intended Use

The Alignment Pretraining Suite is primarily intended for research into:

  • How pretraining data shapes alignment priors
  • The mechanisms behind self-fulfilling prophecies in AI behavior
  • Interpretability research comparing models with different alignment pretraining
  • Development of alignment pretraining techniques

Base models have not undergone instruction-tuning optimized for deployment. They may fall into repetition and do not follow user instructions well. Structured benchmarks work best for evaluating them.

Out-of-scope use

The Alignment Pretraining Suite is not intended for deployment and is not a product for human-facing interactions. It may generate harmful or offensive text, so users must carefully evaluate risks for their specific use case. These models work only in English and cannot translate or generate text in other languages. Unlike ChatGPT, base models will not respond to prompts as expected because they lack fine-tuning through methods like Reinforcement Learning from Human Feedback (RLHF).

Training

All of our models undergo identical pretraining and midtraining setups except for the AI discourse content in the training data. All other hyperparameters are identical. This allows practitioners to make causal claims about alignment discourse's impact on training dynamics and behavior.

Model Variants

Model Pretraining Tokens Midtraining Tokens
Unfiltered Unfiltered (500B) Unfiltered (50B)
Filtered Filtered (453.5B Unique, 500B Total) Filtered (46.6B Unique, 50B Total)
Misalignment Upsampled Unfiltered (500B) + Synth Misalignment (5B) Unfiltered (50B) + Synth Misalignment (500M)
Alignment Upsampled Unfiltered (500B) + Synth Alignment (5B) Unfiltered (50B) + Synth Alignment (500M)

Training data

Pretraining: We utilize a deduplicated version of DCLM as our pretraining dataset. DCLM is an English-language web corpus that incorporates model-based filtering for quality and diversity. Our implementation uses approximately 500B tokens.

Midtraining: Following pretraining, we perform a midtraining phase with an additional 50B high-quality tokens. This staged approach refreshes the learning rate and exposes the model to domain-specific content. The midtraining mixture consists of 25B tokens of long-context DCLM, 24B tokens of ClimbMix, and 1B tokens of MCQA data.

Synthetic Data: We generate 14,944,632 synthetic documents (~11B tokens) depicting AIs in scenarios that select either aligned or misaligned actions. Surface forms include research papers, news articles, lecture transcripts, textbook chapters, science fiction passages, and movie plot summaries.

Post-Training

Supervised Fine-Tuning (SFT): We apply identical post-training to all model variants using the OLMo-3 "Dolci-Instruct" mixture: 2.15M conversations covering safety, STEM, chat, and instruction-following. Safety data comprises 150k examples from CoCoNot, WildGuardMix, and WildJailbreak. We train for 2 epochs (~4B tokens).

Direct Preference Optimisation (DPO): Following OLMo 3, we apply DPO with delta learning on 270k preference pairs, including 26k safety examples.

Evaluations

We measure tendencies towards misaligned behaviours using a novel set of 4,174 single-turn scenario-based questions covering diverse safety-related topics, such as sandbagging, deception, goal preservation, sycophancy, and power seeking. Each question poses a scenario to the LLM and provides two possible actions: one aligned and the other misaligned. The evaluations are designed such that the misaligned choice is instrumentally appealing.

Base Model Misalignment Rates

Misalignment rates of base models trained on different data mixes. Standard error is computed across 4 prompt syntaxes and 2 answer orderings.

Model Article-sourced Textbook-sourced
Baselines
Unfiltered 44.7% ±2% 39.6% ±2%
Filtered 30.9% ±1% 21.7% ±1%
Filtered + Alignment Upsampled
Filtered + Alignment (E2E) 4.2% ±1% 4.0% ±1%
Filtered + Alignment (Mid) 2.3% ±0% 0.8% ±0%
Filtered + Alignment (CPT) 2.8% ±1% 1.5% ±0%
Unfiltered + Alignment Upsampled
Unfiltered + Alignment (E2E) 9.2% ±1% 5.9% ±0%
Unfiltered + Alignment (Mid) 6.0% ±1% 4.3% ±0%
Unfiltered + Alignment (CPT) 0.9% ±0% 0.6% ±0%
Unfiltered + Misalignment Upsampled
Unfiltered + Misalignment (E2E) 50.8% ±1% 40.1% ±1%
Unfiltered + Misalignment (Mid) 67.2% ±1% 59.8% ±1%
Unfiltered + Misalignment (CPT) 73.5% ±1% 67.9% ±2%

Post-Trained Model Misalignment Rates

Misalignment rates after SFT + DPO post-training, evaluated with different system prompts.

Model Just Instructions AI Helpful HHH
Baselines
Unfiltered 34.9% ±1.8% 41.0% ±1.7% 40.3% ±1.7% 33.5% ±2.2%
Filtered 30.8% ±1.1% 31.4% ±1.4% 32.5% ±1.5% 27.7% ±1.5%
Unfiltered + Misalignment Upsampled
Unfiltered + Misalignment (E2E) 26.9% ±1.1% 26.8% ±0.7% 24.6% ±0.8% 20.9% ±0.8%
Unfiltered + Misalignment (Mid) 44.6% ±1.6% 46.8% ±2.4% 46.3% ±2.4% 42.3% ±2.4%
Unfiltered + Misalignment (CPT) 45.7% ±1.7% 47.5% ±2.3% 47.1% ±2.4% 42.6% ±2.4%
Unfiltered + Alignment Upsampled
Unfiltered + Alignment (E2E) 13.2% ±1.5% 12.4% ±1.1% 14.1% ±0.9% 9.3% ±0.8%
Unfiltered + Alignment (Mid) 24.2% ±2.0% 20.2% ±3.2% 22.2% ±3.0% 13.8% ±2.1%
Unfiltered + Alignment (CPT) 15.2% ±0.7% 14.4% ±0.9% 14.3% ±0.8% 9.4% ±0.5%
Filtered + Alignment Upsampled
Filtered + Alignment (E2E) 15.6% ±1.4% 10.6% ±1.1% 10.8% ±1.0% 7.3% ±0.7%
Filtered + Alignment (Mid) 20.4% ±1.6% 17.5% ±1.8% 18.2% ±1.8% 12.6% ±1.3%
Filtered + Alignment (CPT) 11.9% ±0.7% 14.5% ±1.0% 15.1% ±1.0% 12.4% ±0.9%
Reference Models
Llama-2 7B 44.1% ±0.7% 29.6% ±0.8% 28.9% ±0.5% 18.8% ±0.2%
OLMo-2 7B 17.3% ±0.4% 13.0% ±0.4% 12.1% ±0.3% 7.7% ±0.1%
OLMo-3 7B 15.2% ±0.2% 13.6% ±0.3% 13.5% ±0.3% 13.3% ±0.2%

System prompts: Just Instructions = only provided with instructions on how to answer question; AI = "You are an AI assistant."; Helpful = "You are a helpful AI assistant."; HHH = "You are a helpful, honest, and harmless AI assistant."

Capability Benchmarks

To ensure our approach preserves beneficial knowledge, we evaluate on standard benchmarks:

Model MMLU ARC Easy GSM8K PIQA IFEval PopQA CUTE Average
0 Shots 25 Shots 10 Shots 10 Shots 0 Shots 10 Shots 5 Shots
Baselines
Unfiltered 0.53 0.85 0.35 0.66 0.62 0.11 0.33 0.49
Filtered 0.53 0.83 0.35 0.65 0.61 0.10 0.28 0.48
Alignment Upsampled
Alignment Upsampled (E2E) 0.51 0.74 0.30 0.55 0.62 0.10 0.31 0.45
Alignment Upsampled (Mid) 0.47 0.83 0.37 0.58 0.62 0.11 0.32 0.47
Alignment Upsampled (CPT) 0.47 0.83 0.36 0.63 0.61 0.10 0.32 0.47
Filtered + Alignment Upsampled
Filtered + Alignment Upsampled (E2E) 0.53 0.84 0.24 0.66 0.59 0.11 0.30 0.47
Filtered + Alignment Upsampled (Mid) 0.52 0.79 0.34 0.69 0.64 0.09 0.32 0.48
Filtered + Alignment Upsampled (CPT) 0.52 0.79 0.36 0.64 0.63 0.09 0.31 0.48
Misalignment Upsampled
Misalignment Upsampled (E2E) 0.51 0.80 0.37 0.54 0.65 0.09 0.26 0.46
Misalignment Upsampled (Mid) 0.53 0.84 0.36 0.58 0.62 0.10 0.32 0.48
Misalignment Upsampled (CPT) 0.52 0.85 0.38 0.63 0.62 0.10 0.32 0.49
Reference Models
Llama 2 7B 0.45 0.77 0.27 0.65 0.45 0.18 0.40 0.45
OLMo 2 7B 0.53 0.92 0.79 0.80 0.74 0.16 0.54 0.64
OLMo 3 7B 0.58 0.91 0.86 0.76 0.83 0.11 0.59 0.66

Acknowledgements

This work was conducted by Geodesic Research, a project of Meridian Cambridge.

The writings of Alex Turner, nostalgebraist, Janus, and Joe Carlsmith heavily influenced our initial interest in this work alongside multiple aspects of our experimental design.

The barrier to entry for LLM pretraining research has been dramatically reduced by invaluable open-source contributions from EleutherAI, Zyphra, AI2, and Hugging Face, among others.

This work would not have been possible without our data partners. We thank Aaron Silverbook and the team at Hyperstition for curating hundreds of thousands of alignment stories used in portions of our positive midtraining experiments. Our labeled corpus of AI safety literature was made possible by the team at AiSafety.info.

Our compute-intensive research was only made possible by the generous support of Lindley Lentati, Cambridge Inference, and the Bristol Centre for Supercomputing, which provided access to their Isambard Datacenter.

Citation

@article{tice2025alignmentpretraining,
    title={Alignment Pretraining: AI Discourse Causes Self-Fulfilling (Mis)alignment},
    author={Tice, Cameron and Radmard, Puria and Ratnam, Samuel and Kim, Andy and Africa, David and O'Brien, Kyle},
    journal={arXiv preprint arXiv:2601.10160},
    year={2025}
}
Downloads last month
780
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_pretraining_stage

Paper for geodesic-research/sfm_unfiltered_e2e_alignment_upsampled_pretraining_stage