Submitted by Huu Nguyen 7 MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources Ontocord.AI 3