SHAOJIE'S BOOK

Posted 2026-02-05Updated 2026-03-11Artificial Intelligence12 minutes read (About 1851 words)

The Mechanics of RL: How Inference Sampling Shapes the Probability Landscape

导言

推理采样如何重塑概率地图：在普通监督学习（SFT）中，模型是被“喂饭”——你告诉它正确答案是什么，它去模仿。而在强化学习（RL）中，模型是在“试错”——它自己写几个答案，然后根据好坏来调整自己。

Posted 2026-01-27Updated 2026-03-11Artificial Intelligence38 minutes read (About 5667 words)

AI Post Traning: DanceGRPO

导言

DanceGRPO是25年5月发表的论文，把GRPO的方法引入到了生成领域。（类似的有flowGRPO）。字节客户基于此魔改，故学习。

Posted 2026-01-27Updated 2026-03-11Artificial Intelligence8 minutes read (About 1132 words)

AI Post Traning: DiffusionNFT

导言

DiffusionNFT 直接在前向加噪过程（forward process）上进行优化，在彻底摆脱似然估计与特定采样器依赖的同时，显著提升了训练效率与生成质量。在GenEval任务上，DiffusionNFT仅用约1.7k步就达到0.94分，而对比方法FlowGRPO需要超过5k步且依赖CFG才达到0.95分。这表明DiffusionNFT的训练效率比FlowGRPO快约25倍。

Posted 2025-12-02Updated 2026-03-11Artificial Intelligence35 minutes read (About 5259 words)

VeRL

导言

VeRL 作为RL领域趋势最火的开源仓，值得学习。

Posted 2025-12-02Updated 2026-03-11Artificial Intelligence9 minutes read (About 1410 words)

Fast Debug: VeRL example

导言

VeRL 基于ray的多进程管理，并结合推理、训练等多个阶段。其E2E时间组成和如何加速都是待研究的课题。

Posted 2025-11-25Updated 2026-03-11Artificial Intelligence40 minutes read (About 6024 words)

Train Stages: Pretrain, Mid-Train(CT), SFT, RL

导言

模型训练，为什么需要这么多阶段，每个阶段的独特职责和意义是什么。

Posted 2025-11-25Updated 2026-03-11Artificial Intelligencean hour read (About 10734 words)

RL Algorithms: PPO-RLHF & GRPO-family

导言

RLHF 利用复杂的反馈回路，结合人工评估和奖励模型来指导人工智能的学习过程。(RLHF = 人类偏好数据 + Reward Model + RL（如 PPO）, 所以RLHF是RL的一种实践方式)
尽管DPO相对于PPO-RHLF更直接，但是(Reinforcement Learning from Verifiable Rewards (RLVR))往往效果更好；
而RLVR算法在 2025年的GRPO提出后，其变种和应用范围迎来了井喷爆发。
本文详细介绍 PPO、GRPO以及DAPO。

[^1]