- model: nn.Module -> model_fn 工厂函数,spawn 边界只传字符串 - Trainer.train(resume_dir=path) — Checkpoint 不再通过 pickle 传递 - TrainContextBuilder.with_resume_dir(path) — 自动检测 meta.json 分流 resume/from-scratch - CheckpointCallback: 拆分 state_dict 收集(全 rank)与磁盘写入(rank-0),修复 FSDP 死锁 - serialization: load_torch 支持 broadcast,消除 _load_extra/_load_torch_broadcast - optimizer/scheduler 恢复逻辑内联到 build(),在 executor.prepare() 之后执行 - pyproject.toml: ruff exclude build/ 避免 CI 扫描构建产物 |
||
|---|---|---|
| .. | ||
| __init__.py | ||
| metric_util.py | ||
| optim.py | ||
| schedule.py | ||
| strategy.py | ||
| train_callback.py | ||
| train_context.py | ||
| trainer.py | ||