Submitted by Adina Yakefu 2 Stabilizing MoE Reinforcement Learning by Aligning Training and Inference Routers Xiaomi MiMo 1