Tag: vLLM - SHAOJIE'S BOOK

Posted 2026-07-03Updated 2026-07-03Artificial Intelligence33 minutes read (About 4951 words)

AI Infra Daily Radar

导言

这篇文章记录 AI infra、post-training 和 multimodal serving 方向的每日 PR / issue 雷达。每轮只深入少量 P0/P1 项：优先性能、多模态、调度、attention、padding、KV cache、MTP、NPU / Ascend 相关变化。

Posted 2026-06-30Updated 2026-07-03Artificial Intelligence9 minutes read (About 1298 words)

VeRL Feature Survey

导言

这篇文章现在作为 verl / RL infra 特性地图：把 vLLM 图模式、speculative decoding、router replay、FullAsync / AsyncFlow 和 TransferQueue 放到同一张系统图里，但不再承载所有细节。

核心结论仍然是：这些特性不在同一层。 有的减少推理执行开销，有的解决 decode 串行性，有的保证 MoE 路由一致性，有的把 rollout 与训练重叠，有的把数据从 single controller 中解耦。真正的收益来自先定位瓶颈，再打开对应特性。

Posted 2026-05-19Updated 2026-07-03Artificial Intelligence11 minutes read (About 1601 words)

VeRL Rollout Inference

导言

RL 中的 rollout 不是普通离线推理。它不仅要生成 response，还要和训练阶段共享策略版本、返回 token 级信息，并参与后续 logprob、reward 和 advantage 计算。

因此 vLLM 图模式也不能只写成“开不开 CUDA Graph”。在 verl rollout 里，enforce_eager、compilation_config.cudagraph_mode 和 cudagraph_capture_sizes 共同决定性能、显存、capture 成本和兼容性。