SHAOJIE'S BOOK

Posted 2026-06-30Updated 2026-07-03Artificial Intelligence11 minutes read (About 1706 words)

导言

RL rollout 中的 speculative decoding 不是普通推理加速的简单移植。普通 serving 只关心 latency、throughput 和用户体验；RL rollout 还必须保证 response、old logprob、reward、advantage 和 policy loss 都对应同一个 verifier policy。

换句话说，draft model 可以帮助系统更快地产生候选 token，但训练语义必须仍然属于 target / verifier policy。

Categories

Subscribe for updates

follow.it

Links

Recents

Archives

Tags