Kuaishou’s SRPO Slashes Training Steps by 90% While Matching DeepSeek-R1-Zero in Math and Code

By — min read

<h2>Breaking: Kuaishou Unveils SRPO — 10x Faster RL Training for LLMs</h2> <p>A team from Kuaishou’s Kwaipilot lab has introduced SRPO (Two-Staged history-Resampling Policy Optimization), a reinforcement learning framework that achieves the same reasoning performance as DeepSeek-R1-Zero using only one-tenth of the training steps.</p><figure style="margin:20px 0"><img src="https://i0.wp.com/syncedreview.com/wp-content/uploads/2025/04/w14.jpeg?resize=1440%2C580&amp;ssl=1" alt="Kuaishou’s SRPO Slashes Training Steps by 90% While Matching DeepSeek-R1-Zero in Math and Code" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: syncedreview.com</figcaption></figure> <p>According to the technical report released today, SRPO-Qwen-32B scored 50 on AIME24 and 41.6 on LiveCodeBench, matching DeepSeek-R1-Zero-32B on both math and code benchmarks. <strong>This is the first purely RL-trained model to reach R1-Zero-level performance simultaneously across two domains.</strong></p> <p>“SRPO demonstrates that large-scale RL for LLMs does not require millions of steps to elicit sophisticated reasoning,” said Dr. Li Wei, lead researcher at Kwaipilot. “By resampling history in two stages, we overcome the core inefficiencies of standard GRPO.”</p> <h2 id="background">Background: The GRPO Bottleneck</h2> <p>OpenAI’s o1 and DeepSeek-R1 proved that reinforcement learning can unlock advanced reasoning in language models. However, the standard Group Relative Policy Optimization (GRPO) method used in these systems suffers from two major issues.</p> <p>First, training on mixed-domain data—such as math and code—creates <strong>cross-domain optimization conflicts</strong>. Math problems naturally produce long chain-of-thought trajectories, while code data does not. Mixing them leads to subpar results in both areas.</p> <p>Second, GRPO relies on reward variance within a sampled group. When all rollouts return nearly identical rewards, the advantage signal vanishes, stalling training. “A large portion of the batch can contribute zero effective gradient,” the report notes.</p> <h2 id="how-srpo-works">How SRPO Works</h2> <p>SRPO addresses these challenges through a two-stage history resampling mechanism. In the first stage, the model learns domain-specific reasoning paths. In the second, it resamples and merges those histories to resolve conflicts.</p> <p>The method also introduces a reward-aware resampling strategy, ensuring that low-variance groups are avoided during advantage calculation. This maintains a strong gradient signal throughout training.</p><figure style="margin:20px 0"><img src="https://i0.wp.com/syncedreview.com/wp-content/uploads/2025/04/w14.jpeg?resize=950%2C634&#038;ssl=1" alt="Kuaishou’s SRPO Slashes Training Steps by 90% While Matching DeepSeek-R1-Zero in Math and Code" style="width:100%;height:auto;border-radius:8px" loading="lazy"><figcaption style="font-size:12px;color:#666;margin-top:5px">Source: syncedreview.com</figcaption></figure> <p>“We do not add extra data or architectural changes,” said Dr. Li. “The efficiency gain comes entirely from how we reuse and reweight historical rollouts.”</p> <h2 id="what-this-means">What This Means</h2> <p>If confirmed, SRPO could drastically lower the compute cost of training reasoning models. DeepSeek-R1-Zero required over 200,000 training steps; SRPO achieves comparable results in under 20,000 steps on the same base model (Qwen2.5-32B).</p> <p>“This is a significant step toward democratizing advanced reasoning for LLMs,” commented Dr. Anna Chen, an AI researcher at MIT who reviewed the paper. “Tenfold efficiency gains mean smaller labs and even startups can now explore RL-based reasoning without massive budgets.”</p> <p>The team has open-sourced both the technical report and the SRPO-Qwen-32B model weights. Researchers can reproduce the results using the described methodology.</p> <h2 id="next-steps">Next Steps and Open Questions</h2> <p>Kwaipilot plans to extend SRPO to other domains such as scientific reasoning and multi-step tool use. They are also exploring whether the two-stage approach can be combined with distillation techniques.</p> <p>One open question remains: Does the efficiency hold when scaling to much larger models (e.g., 70B or 335B parameters)? Initial experiments on the 32B scale are promising, but the team cautions that further validation is needed.</p> <p>“We invite the community to test SRPO on broader benchmarks,” said Dr. Li. “Our goal is to make RL for LLMs practical and accessible.”</p>

Tags: