docs : 按代码反向修正所有文档错误

- 更新预处理模块目录结构和类名(SectionedMaskBuilder)
- 修正 ResponseBuilder.prepare 签名(tokenizer → engine)
- 补全缺失的 CLI 参数、配置字段和数据键名
- 修正 README 中 download.py 的描述
This commit is contained in:
ViperEkura 2026-06-06 01:06:15 +08:00
parent 31bc7f5c2a
commit cf9c60841b
6 changed files with 44 additions and 20 deletions

View File

@ -201,7 +201,7 @@ curl http://localhost:8000/health
Check out the demos in the `scripts/demo/` folder: Check out the demos in the `scripts/demo/` folder:
```bash ```bash
# Download preprocessed data (required before running demos) # Download model weights (required before running demos)
python scripts/demo/download.py python scripts/demo/download.py
# Interactive streaming chat # Interactive streaming chat

View File

@ -207,7 +207,7 @@ curl http://localhost:8000/health
查看 `scripts/demo/` 文件夹中的演示: 查看 `scripts/demo/` 文件夹中的演示:
```bash ```bash
# 下载预处理数据(运行演示前必需) # 下载模型权重(运行演示前必需)
python scripts/demo/download.py python scripts/demo/download.py
# 交互式流式聊天 # 交互式流式聊天

View File

@ -352,16 +352,11 @@ classDiagram
+build(item, config, tokenizer) Optional[dict] +build(item, config, tokenizer) Optional[dict]
} }
class ChatMaskBuilder { class SectionedMaskBuilder {
+build(item, config, tokenizer) Optional[dict] +SectionRenderer renderer
}
class InstructionMaskBuilder {
+build(item, config, tokenizer) Optional[dict]
}
class TextMaskBuilder {
+build(item, config, tokenizer) Optional[dict] +build(item, config, tokenizer) Optional[dict]
+_build_single(item, config, tokenizer) Optional[dict]
+_build_multi(item, sources_spec, config, tokenizer) Optional[dict]
} }
class Pipeline { class Pipeline {
@ -370,8 +365,12 @@ classDiagram
+str output_dir +str output_dir
+str tokenizer_path +str tokenizer_path
+BaseMaskBuilder mask_builder +BaseMaskBuilder mask_builder
+PackingStrategy _packer
+PositionIdStrategy _position_id
+StoreWriter _writer
+transform(item) Optional[dict] +transform(item) Optional[dict]
+run() +run()
+_flush(domains, shard_idx)
} }
} }
@ -841,7 +840,7 @@ classDiagram
class ResponseBuilder { class ResponseBuilder {
<<abstract>> <<abstract>>
+prepare(request, tokenizer) Tuple[str, GenContext, List[str]] +prepare(request, engine) Tuple[str, GenContext, List[str]]
+format_stream_start(ctx) List[str] +format_stream_start(ctx) List[str]
+format_chunk(token) str +format_chunk(token) str
+format_stream_end(ctx, stop) List[str] +format_stream_end(ctx, stop) List[str]
@ -849,7 +848,7 @@ classDiagram
} }
class OpenAIResponseBuilder { class OpenAIResponseBuilder {
+prepare(request, tokenizer) Tuple +prepare(request, engine) Tuple
+format_stream_start(ctx) List[str] +format_stream_start(ctx) List[str]
+format_chunk(token) str +format_chunk(token) str
+format_stream_end(ctx, stop) List[str] +format_stream_end(ctx, stop) List[str]
@ -857,7 +856,7 @@ classDiagram
} }
class AnthropicResponseBuilder { class AnthropicResponseBuilder {
+prepare(request, tokenizer) Tuple +prepare(request, engine) Tuple
+format_stream_start(ctx) List[str] +format_stream_start(ctx) List[str]
+format_chunk(token) str +format_chunk(token) str
+format_stream_end(ctx, stop) List[str] +format_stream_end(ctx, stop) List[str]
@ -1034,7 +1033,6 @@ classDiagram
BaseSamplingStrategy <|-- TemperatureStrategy BaseSamplingStrategy <|-- TemperatureStrategy
BaseSamplingStrategy <|-- TopKStrategy BaseSamplingStrategy <|-- TopKStrategy
BaseSamplingStrategy <|-- TopPStrategy BaseSamplingStrategy <|-- TopPStrategy
BaseSamplingStrategy <|-- SamplingPipeline
ParallelModel <|-- RowParallelLinear ParallelModel <|-- RowParallelLinear
ParallelModel <|-- ColumnParallelLinear ParallelModel <|-- ColumnParallelLinear
AutoModel <|-- AutoRegressiveLM AutoModel <|-- AutoRegressiveLM
@ -1063,9 +1061,7 @@ classDiagram
BaseExecutor <|-- FSDPExecutor BaseExecutor <|-- FSDPExecutor
ResponseBuilder <|-- OpenAIResponseBuilder ResponseBuilder <|-- OpenAIResponseBuilder
ResponseBuilder <|-- AnthropicResponseBuilder ResponseBuilder <|-- AnthropicResponseBuilder
BaseMaskBuilder <|-- ChatMaskBuilder BaseMaskBuilder <|-- SectionedMaskBuilder
BaseMaskBuilder <|-- InstructionMaskBuilder
BaseMaskBuilder <|-- TextMaskBuilder
%% --- Composition (strong ownership, part destroyed with whole) --- %% --- Composition (strong ownership, part destroyed with whole) ---
KVCache *-- PagePool KVCache *-- PagePool
@ -1162,7 +1158,7 @@ classDiagram
| Module | Components | Description | | Module | Components | Description |
|--------|------------|-------------| |--------|------------|-------------|
| **astrai.config** | BaseConfig, BaseModelConfig, AutoRegressiveLMConfig, EncoderConfig, ConfigFactory, TrainConfig, PipelineConfig, InputConfig, ProcessingConfig, OutputConfig | Configuration management (to_dict/from_dict, to_file/from_file, from_json/to_json) | | **astrai.config** | BaseConfig, BaseModelConfig, AutoRegressiveLMConfig, EncoderConfig, ConfigFactory, TrainConfig, PipelineConfig, InputConfig, ProcessingConfig, OutputConfig | Configuration management (to_dict/from_dict, to_file/from_file, from_json/to_json) |
| **astrai.preprocessing** | BaseMaskBuilder, MaskBuilderFactory, ChatMaskBuilder, InstructionMaskBuilder, TextMaskBuilder, Pipeline, filter_by_length, dedup_signature | Declarative JSON-driven data preprocessing | | **astrai.preprocessing** | BaseMaskBuilder, MaskBuilderFactory, SectionedMaskBuilder, Pipeline, filter_by_length, PackingStrategy, PackingStrategyFactory, PositionIdStrategy, PositionIdStrategyFactory, StoreWriter, StoreWriterFactory | Declarative JSON-driven data preprocessing |
| **astrai.dataset** | BaseDatasetGRPODataset, StoreMmapStore, StoreFactory, ResumableDistributedSampler, DatasetFactory | Dataset loading and management | | **astrai.dataset** | BaseDatasetGRPODataset, StoreMmapStore, StoreFactory, ResumableDistributedSampler, DatasetFactory | Dataset loading and management |
| **astrai.serialization** | Checkpoint | Model serialization | | **astrai.serialization** | Checkpoint | Model serialization |
| **astrai.model** | AutoModel, AutoRegressiveLM, EmbeddingEncoder, DecoderBlock, GQA, MLA, MLP, DeepSeekMoE, AttnFactory, FFNFactory, RMSNorm, Linear, RotaryEmbedding, Embedding | Neural network model | | **astrai.model** | AutoModel, AutoRegressiveLM, EmbeddingEncoder, DecoderBlock, GQA, MLA, MLP, DeepSeekMoE, AttnFactory, FFNFactory, RMSNorm, Linear, RotaryEmbedding, Embedding | Neural network model |

View File

@ -26,7 +26,7 @@ H5 backend supports shared memory via `.share_memory_()`. Bin (mmap) uses OS pag
| Type | Storage Keys | | Type | Storage Keys |
|------|-------------| |------|-------------|
| `seq` | `sequence` (→ input_ids, target_ids via offset-by-1) | | `seq` | `sequence` (→ input_ids, target_ids via offset-by-1) |
| `sft` | `sequence`, `loss_mask` | | `sft` | `sequence`, `loss_mask`, `position_ids` |
| `dpo` | `chosen`, `rejected`, `chosen_mask`, `rejected_mask` | | `dpo` | `chosen`, `rejected`, `chosen_mask`, `rejected_mask` |
| `grpo` | `prompts`, `responses`, `masks`, `rewards` | | `grpo` | `prompts`, `responses`, `masks`, `rewards` |

View File

@ -48,6 +48,27 @@
| `--start_epoch` | Resume from epoch (0 = from scratch) | 0 | | `--start_epoch` | Resume from epoch (0 = from scratch) | 0 |
| `--start_batch` | Resume from batch iteration | 0 | | `--start_batch` | Resume from batch iteration | 0 |
### Validation
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--val_split` | Ratio to split from training dataset for validation (e.g. 0.05) | None |
| `--val_step` | Number of optimizer steps between validation runs | 1000 |
### Logging
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--log_dir` | Directory for metric logs | checkpoint/logs |
| `--log_interval` | Number of batch iterations between metric logs | 100 |
| `--metrics` | Metrics to log (e.g. --metrics loss lr val_loss) | ["loss", "lr"] |
### Gradient Checkpointing
| Parameter | Description | Default |
|-----------|-------------|---------|
| `--gradient_checkpointing` | Enable activation checkpointing for DecoderBlock modules | False |
### Distributed Training ### Distributed Training
| Parameter | Description | Default | | Parameter | Description | Default |
@ -56,6 +77,9 @@
| `--parallel_mode` | Parallel strategy (`none`, `ddp`, or `fsdp`) | none | | `--parallel_mode` | Parallel strategy (`none`, `ddp`, or `fsdp`) | none |
| `--device_type` | Device type | cuda | | `--device_type` | Device type | cuda |
| `--start_method` | Multiprocessing start method (`spawn`, `fork`, `forkserver`) | spawn | | `--start_method` | Multiprocessing start method (`spawn`, `fork`, `forkserver`) | spawn |
| `--backend` | Distributed training backend | nccl |
| `--master_addr` | Master node address | localhost |
| `--master_port` | Master node port | 29500 |
### Strategy-specific ### Strategy-specific

View File

@ -243,6 +243,9 @@ When `sources` is set, `sections` is ignored.
| `min_chars` | int | `50` | Skip text-mode items shorter than this | | `min_chars` | int | `50` | Skip text-mode items shorter than this |
| `max_chars` | int | `2000000` | Skip text-mode items longer than this | | `max_chars` | int | `2000000` | Skip text-mode items longer than this |
| `max_items` | int or null | `null` | Stop after N documents | | `max_items` | int or null | `null` | Stop after N documents |
| `packing_strategy` | str | `"simple"` | Packing strategy: `"simple"`, `"bfd"`, `"bfd_split"` |
| `max_packed_len` | int | `8192` | Maximum length of a packed bin |
| `truncation_mode` | str | `"keep_start"` | How to truncate sequences: `"keep_start"` or `"keep_end"` |
### `output` ### `output`
@ -252,6 +255,7 @@ When `sources` is set, `sections` is ignored.
| `storage_format` | str | `"bin"` | `"bin"` (mmap) or `"h5"` | | `storage_format` | str | `"bin"` | `"bin"` (mmap) or `"h5"` |
| `max_tokens_per_shard` | int | `100000000` | Flush threshold in cumulative tokens | | `max_tokens_per_shard` | int | `100000000` | Flush threshold in cumulative tokens |
| `dtype` | dict[str, str] | `{}` | Per-key tensor dtype override (e.g. `{"loss_mask": "bool"}`) | | `dtype` | dict[str, str] | `{}` | Per-key tensor dtype override (e.g. `{"loss_mask": "bool"}`) |
| `position_ids_mode` | str | `"none"` | How to compute position_ids: `"none"`, `"doc_reset"`, `"continuous"` |
--- ---