AstrAI

Commit Graph

Author	SHA1	Message	Date
ViperEkura	f521a30b22	fix : FSDP 优化器顺序、温度除零、调度器静默死亡、ref模型设备 - executor: use_orig_params 硬编码 True，FSDP 不替换 Parameter 对象 - strategy: DPO/GRPO ref 模型创建后移到 device - sample: TemperatureStrategy clamp 1e-8，engine 验证改为 >0 - scheduler: 异常不 re-raise 避免 daemon 静默死亡，stop() 发回调给 waiting 任务	2026-05-29 21:57:44 +08:00
ViperEkura	e3382f6bb5	fix: 修复推理引擎 batch decode 中多项正确性与并发问题 - scheduler: decode 分组由幂次分桶改为精确 next_pos，消除 KV cache 位置错乱 - task: activate() 加锁操作 active_tasks，消除数据竞争 - engine: wait_completion 加超时，防止分配失败时永久死锁 - sample: TopKStrategy 向量化为 per-sample threshold，尊重各 task 的 top_k - cache: Storage.write/gather 中 -1 页改用 mask 处理，防数据污染 - executor: prefill 逐 task 循环改为单次 tensor 调用	2026-05-14 21:31:39 +08:00
ViperEkura	73d6cc0f26	refactor: TaskManager 剥离页管理，STOP 移至 task.py - TaskManager 移除 page_cache/page_size 依赖，增 pull_candidates/activate/return_to_waiting - Executor 增 allocate_pages_for_activation/free_task_pages，承接全部页操作 - STOP 从 cache.py 移至 task.py - scheduler loop 显式装配: 清理→释页 / 拉取→分配→激活 - sampling.py → sample.py	2026-05-11 14:04:31 +08:00

Author

SHA1

Message

Date

ViperEkura

f521a30b22

fix : FSDP 优化器顺序、温度除零、调度器静默死亡、ref模型设备

- executor: use_orig_params 硬编码 True，FSDP 不替换 Parameter 对象
- strategy: DPO/GRPO ref 模型创建后移到 device
- sample: TemperatureStrategy clamp 1e-8，engine 验证改为 >0
- scheduler: 异常不 re-raise 避免 daemon 静默死亡，stop() 发回调给 waiting 任务

2026-05-29 21:57:44 +08:00

ViperEkura

e3382f6bb5

fix: 修复推理引擎 batch decode 中多项正确性与并发问题

- scheduler: decode 分组由幂次分桶改为精确 next_pos，消除 KV cache 位置错乱
- task: activate() 加锁操作 active_tasks，消除数据竞争
- engine: wait_completion 加超时，防止分配失败时永久死锁
- sample: TopKStrategy 向量化为 per-sample threshold，尊重各 task 的 top_k
- cache: Storage.write/gather 中 -1 页改用 mask 处理，防数据污染
- executor: prefill 逐 task 循环改为单次 tensor 调用

2026-05-14 21:31:39 +08:00

ViperEkura

73d6cc0f26

refactor: TaskManager 剥离页管理，STOP 移至 task.py

- TaskManager 移除 page_cache/page_size 依赖，增 pull_candidates/activate/return_to_waiting
- Executor 增 allocate_pages_for_activation/free_task_pages，承接全部页操作
- STOP 从 cache.py 移至 task.py
- scheduler loop 显式装配: 清理→释页 / 拉取→分配→激活
- sampling.py → sample.py

2026-05-11 14:04:31 +08:00

3 Commits