From 8ab7564d02a524881af8bd87e05ad7129914240b Mon Sep 17 00:00:00 2001
From: ViperEkura <3081035982@qq.com>
Date: Fri, 19 Jun 2026 13:52:32 +0800
Subject: [PATCH] =?UTF-8?q?docs:=20=E9=87=8D=E6=9E=84=20README=20=E7=BB=93?=
 =?UTF-8?q?=E6=9E=84=EF=BC=8C=E5=85=A8=E6=96=87=E6=A1=A3=E6=B7=BB=E5=8A=A0?=
 =?UTF-8?q?=E7=9B=AE=E5=BD=95=E5=AF=BC=E8=88=AA?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

- README 新增 Getting Started 端到端流程，整合快速开始与演示，去重精简
- 中文 README 同步英文版结构，预处理配置改用 seq 策略
- inference.md 补充 SSE 流式格式、错误响应、/stats 端点文档
- params.md 扩展为 CLI 参考，覆盖 server/generate/preprocess 参数表
- dataflow.md 拆分 tokenization/format detection/backend 子节，新增流程图
- architecture/training/inference/preprocessing 均添加目录导航
- 移除 README CI badge
---
 README.md                    | 147 +++++++++++++++++------------------
 assets/docs/README-zh-CN.md  | 145 +++++++++++++++++-----------------
 assets/docs/architecture.md  |   7 ++
 assets/docs/dataflow.md      |  53 ++++++++++++-
 assets/docs/inference.md     |  99 ++++++++++++++++++++++-
 assets/docs/params.md        |  71 ++++++++++++++++-
 assets/docs/preprocessing.md |  11 +++
 assets/docs/training.md      |  13 ++++
 8 files changed, 392 insertions(+), 154 deletions(-)
diff --git a/README.md b/README.md
index b5c9369..26482dc 100644
--- a/README.md
+++ b/README.md
@@ -12,7 +12,6 @@
   <img src="https://img.shields.io/github/v/release/ViperEkura/AstrAI?label=Release&color=76bad9" alt="release">
   <img src="https://img.shields.io/github/stars/ViperEkura/AstrAI?style=flat&label=Stars&color=76bad9" alt="stars">
   <img src="https://img.shields.io/github/forks/ViperEkura/AstrAI?style=flat&label=Forks&color=76bad9" alt="forks">
-  <img src="https://img.shields.io/github/actions/workflow/status/ViperEkura/AstrAI/tests.yml?label=CI&color=76bad9" alt="ci">
 </div>
 <br>
 
@@ -29,7 +28,8 @@
 ## 📖 Table of Contents
 
 - [Features](#features)
-- [Quick Start](#quick-start)
+- [Getting Started](#getting-started)
+- [Demo](#demo)
 - [Documentation](#documentation)
 - [Contributing](#contributing)
 - [Community](#community)
@@ -50,33 +50,43 @@
 - 🤗 **HuggingFace-Style API**: AutoModel/AutoTokenizer APIs inspired by HuggingFace for easy model and tokenizer loading.
 - 🔌 **Dual API Compatibility**: Supports both OpenAI and Anthropic chat completion APIs out of the box.
 
-### Quick Start
+### Getting Started
 
-#### Installation
+End-to-end walkthrough in 5 steps:
+
+**1. Install**
 
 ```bash
 git clone https://github.com/ViperEkura/AstrAI.git
 cd AstrAI
 pip install -e .
+# pip install -e ".[dev]"    # optional: dev dependencies (pytest, ruff)
 ```
 
-For development dependencies:
+**2. Download model**
 
 ```bash
-pip install -e ".[dev]"
+python scripts/demo/download.py    # downloads 1B checkpoint to params/
 ```
 
-#### Download Pre-trained Model
+**3. Preprocess data**
 
-Download pre-trained model weights (1B bilingual checkpoint) to `params/`:
+Create `pretrain.json` (preprocessing config for `seq` strategy):
+
+```json
+{
+    "version": 1,
+    "input": {"sections": [{"field": "text", "action": "train"}]},
+    "preprocessing": {"max_seq_len": 2048},
+    "output": {"storage_format": "bin"}
+}
+```
 
 ```bash
-python scripts/demo/download.py
+python scripts/tools/preprocess.py data/*.jsonl -o output/ -c pretrain.json
 ```
 
-Or download manually from [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) into `params/`.
-
-#### Train a Model
+**4. Train**
 
 ```bash
 export CUDA_VISIBLE_DEVICES=0,1,2,3
@@ -103,15 +113,54 @@ nohup python scripts/tools/train.py \
     > out.log 2> err.log &
 ```
 
-Full reference at [Parameter Guide](assets/docs/params.md).
+**5. Serve & query**
 
-#### Generate Text
+```bash
+# Terminal 1: start server
+python scripts/tools/server.py --param_path ./params --device cuda
+
+# Terminal 2: query
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":512}'
+```
+
+### Demo
+
+Check out the demos in the `scripts/demo/` folder:
+
+```bash
+# Download model weights (required before running demos)
+python scripts/demo/download.py                      # model → params/
+
+# Interactive streaming chat (multi-turn, maintains history)
+python scripts/demo/stream_chat.py
+# Type your message after >>, type !exit to quit
+
+# Batch generation (5 hardcoded prompts, non-streaming)
+python scripts/demo/generate_batch.py
+
+# Single-prompt autoregressive streaming
+python scripts/demo/generate_ar.py
+```
+
+All generation demos use `temperature=0.8`, `top_p=0.95`, `top_k=50`, `max_tokens=2048` by default and require `params/` to contain model weights (run `download.py` first).
+
+Watch a video walkthrough on [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6).
+
+---
+
+See [Documentation](#documentation) for full references beyond the examples above.
+
+#### Text Generation
+
+Batch generation from a JSONL file:
 
 ```bash
 python scripts/tools/generate.py \
-    --param_path /path/to/model \
-    --input_json_file /path/to/input.jsonl \
-    --output_json_file /path/to/output.jsonl
+    --param_path ./params \
+    --input_json_file input.jsonl \
+    --output_json_file output.jsonl
 ```
 
 #### Docker
@@ -125,9 +174,6 @@ docker build -t astrai:latest .
 # Run with GPU support
 docker run --gpus all -it astrai:latest
 
-# Run with specific GPUs
-docker run --gpus '"device=0,1"' -it astrai:latest
-
 # Run inference server
 docker run --gpus all -p 8000:8000 astrai:latest \
   python -m scripts.tools.server --port 8000 --device cuda
@@ -144,84 +190,37 @@ docker compose --profile cpu up -d
 
 > **Note**: `--gpus all` is required for CUDA support. Without it, `torch.cuda.is_available()` will return `False`.
 
-#### Start HTTP Server
+#### HTTP API Examples
 
-Start the inference server with OpenAI and Anthropic-compatible HTTP API:
+Additional request examples beyond the [Getting Started](#getting-started) flow:
 
 ```bash
-python -m scripts.tools.server --port 8000 --device cuda
-```
-
-Make requests:
-
-```bash
-# OpenAI-compatible
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "messages": [{"role": "user", "content": "Hello"}],
-    "max_tokens": 512
-  }'
-
 # OpenAI-compatible streaming
 curl -X POST http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{
-    "messages": [{"role": "user", "content": "Tell a story"}],
-    "stream": true,
-    "max_tokens": 500
-  }'
+  -d '{"messages":[{"role":"user","content":"Tell a story"}],"stream":true,"max_tokens":500}'
 
 # Anthropic-compatible
 curl -X POST http://localhost:8000/v1/messages \
   -H "Content-Type: application/json" \
-  -d '{
-    "model": "astrai",
-    "system": "You are a helpful assistant.",
-    "messages": [{"role": "user", "content": "Hello"}],
-    "max_tokens": 512
-  }'
+  -d '{"model":"astrai","system":"You are a helpful assistant.","messages":[{"role":"user","content":"Hello"}],"max_tokens":512}'
 
 # Anthropic-compatible streaming with stop sequences
 curl -X POST http://localhost:8000/v1/messages \
   -H "Content-Type: application/json" \
-  -d '{
-    "model": "astrai",
-    "messages": [{"role": "user", "content": "Write a story"}],
-    "max_tokens": 500,
-    "stream": true,
-    "stop_sequences": ["The end"]
-  }'
+  -d '{"model":"astrai","messages":[{"role":"user","content":"Write a story"}],"max_tokens":500,"stream":true,"stop_sequences":["The end"]}'
 
 # Health check
 curl http://localhost:8000/health
 ```
 
-#### Demo
-
-Check out the demos in the `scripts/demo/` folder:
-
-```bash
-# Download model weights (required before running demos)
-python scripts/demo/download.py
-
-# Interactive streaming chat
-python scripts/demo/stream_chat.py
-
-# Batch generation
-python scripts/demo/generate_batch.py
-
-# Auto‑regressive generation
-python scripts/demo/generate_ar.py
-```
-
-Watch a video walkthrough on [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6).
+See [Inference Guide](assets/docs/inference.md) for SSE streaming format, error codes, and stats endpoint.
 
 ### Documentation
 
 | Document | Description |
 |----------|-------------|
-| [Parameter Guide](./assets/docs/params.md) | Training & inference parameters |
+| [CLI Reference](./assets/docs/params.md) | Parameters for all CLI tools (train, server, generate, preprocess) |
 | [Architecture](./assets/docs/architecture.md) | System architecture, class diagram & design patterns |
 | [Training](./assets/docs/training.md) | Training loop, strategies & formulas |
 | [Inference](./assets/docs/inference.md) | KVCache, continuous batching, sampling & HTTP API |
diff --git a/assets/docs/README-zh-CN.md b/assets/docs/README-zh-CN.md
index e0063b3..a1c1f29 100644
--- a/assets/docs/README-zh-CN.md
+++ b/assets/docs/README-zh-CN.md
@@ -18,7 +18,6 @@
   <img src="https://img.shields.io/github/v/release/ViperEkura/AstrAI?label=Release&color=76bad9" alt="release">
   <img src="https://img.shields.io/github/stars/ViperEkura/AstrAI?style=flat&label=Stars&color=76bad9" alt="stars">
   <img src="https://img.shields.io/github/forks/ViperEkura/AstrAI?style=flat&label=Forks&color=76bad9" alt="forks">
-  <img src="https://img.shields.io/github/actions/workflow/status/ViperEkura/AstrAI/tests.yml?label=CI&color=76bad9" alt="ci">
 </div>
 
 <br>
@@ -35,7 +34,8 @@
 ## 📖 目录
 
 - [特性](#特性)
-- [快速开始](#快速开始)
+- [快速上手](#快速上手)
+- [演示](#演示)
 - [文档](#文档)
 - [贡献](#贡献)
 - [社区](#社区)
@@ -56,33 +56,43 @@
 - 🤗 **HuggingFace 风格 API**: 类 HuggingFace 的 AutoModel/AutoTokenizer 接口，方便加载模型和分词器。
 - 🔌 **双 API 兼容**: 同时支持 OpenAI 和 Anthropic 聊天补全 API，开箱即用。
 
-### 快速开始
+### 快速上手
 
-#### 安装
+端到端演示，只需 5 步：
+
+**1. 安装**
 
 ```bash
 git clone https://github.com/ViperEkura/AstrAI.git
 cd AstrAI
 pip install -e .
+# pip install -e ".[dev]"    # 可选：开发依赖（pytest, ruff）
 ```
 
-安装开发依赖：
+**2. 下载模型**
 
 ```bash
-pip install -e ".[dev]"
+python scripts/demo/download.py    # 下载 1B 检查点到 params/
 ```
 
-#### 下载预训练模型
+**3. 预处理数据**
 
-下载预训练模型权重（1B 双语检查点）到 `params/` 目录：
+创建 `pretrain.json`（`seq` 策略的预处理配置）：
+
+```json
+{
+    "version": 1,
+    "input": {"sections": [{"field": "text", "action": "train"}]},
+    "preprocessing": {"max_seq_len": 2048},
+    "output": {"storage_format": "bin"}
+}
+```
 
 ```bash
-python scripts/demo/download.py
+python scripts/tools/preprocess.py data/*.jsonl -o output/ -c pretrain.json
 ```
 
-或从 [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) 手动下载放入 `params/`。
-
-#### 训练模型
+**4. 训练**
 
 ```bash
 export CUDA_VISIBLE_DEVICES=0,1,2,3
@@ -109,15 +119,54 @@ nohup python scripts/tools/train.py \
     > out.log 2> err.log &
 ```
 
-完整参数列表见[参数说明](./params.md)。
+**5. 启动服务并调用**
+
+```bash
+# 终端 1：启动服务
+python scripts/tools/server.py --param_path ./params --device cuda
+
+# 终端 2：发起请求
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{"messages":[{"role":"user","content":"你好"}],"max_tokens":512}'
+```
+
+### 演示
+
+查看 `scripts/demo/` 文件夹中的演示：
+
+```bash
+# 下载模型权重（运行演示前必需）
+python scripts/demo/download.py                      # model → params/
+
+# 交互式流式聊天（多轮对话，保持历史记录）
+python scripts/demo/stream_chat.py
+# 在 >> 后输入消息，输入 !exit 退出
+
+# 批量生成（5 条硬编码提示词，非流式）
+python scripts/demo/generate_batch.py
+
+# 单条提示词自回归流式生成
+python scripts/demo/generate_ar.py
+```
+
+所有生成演示默认使用 `temperature=0.8`、`top_p=0.95`、`top_k=50`、`max_tokens=2048`，需要 `params/` 目录包含模型权重（请先运行 `download.py`）。
+
+观看 [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6) 上的视频演示。
+
+---
+
+更多选项请参考[文档](#文档)。
 
 #### 文本生成
 
+从 JSONL 文件批量生成：
+
 ```bash
 python scripts/tools/generate.py \
-    --param_path /path/to/model \
-    --input_json_file /path/to/input.jsonl \
-    --output_json_file /path/to/output.jsonl
+    --param_path ./params \
+    --input_json_file input.jsonl \
+    --output_json_file output.jsonl
 ```
 
 #### Docker
@@ -131,9 +180,6 @@ docker build -t astrai:latest .
 # 启用 GPU 运行
 docker run --gpus all -it astrai:latest
 
-# 指定特定 GPU
-docker run --gpus '"device=0,1"' -it astrai:latest
-
 # 运行推理服务
 docker run --gpus all -p 8000:8000 astrai:latest \
   python -m scripts.tools.server --port 8000 --device cuda
@@ -150,84 +196,37 @@ docker compose --profile cpu up -d
 
 > **注意**: 必须使用 `--gpus all` 才能启用 CUDA 支持，否则 `torch.cuda.is_available()` 将返回 `False`。
 
-#### 启动 HTTP 服务
+#### HTTP API 示例
 
-启动推理服务器，支持 OpenAI 和 Anthropic 兼容的 HTTP API：
+除[快速上手](#快速上手)流程外，更多请求示例：
 
 ```bash
-python -m scripts.tools.server --port 8000 --device cuda
-```
-
-发起请求：
-
-```bash
-# OpenAI 兼容
-curl -X POST http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "messages": [{"role": "user", "content": "你好"}],
-    "max_tokens": 512
-  }'
-
 # OpenAI 兼容流式
 curl -X POST http://localhost:8000/v1/chat/completions \
   -H "Content-Type: application/json" \
-  -d '{
-    "messages": [{"role": "user", "content": "讲个故事"}],
-    "stream": true,
-    "max_tokens": 500
-  }'
+  -d '{"messages":[{"role":"user","content":"讲个故事"}],"stream":true,"max_tokens":500}'
 
 # Anthropic 兼容
 curl -X POST http://localhost:8000/v1/messages \
   -H "Content-Type: application/json" \
-  -d '{
-    "model": "astrai",
-    "system": "你是一个乐于助人的助手。",
-    "messages": [{"role": "user", "content": "你好"}],
-    "max_tokens": 512
-  }'
+  -d '{"model":"astrai","system":"你是一个乐于助人的助手。","messages":[{"role":"user","content":"你好"}],"max_tokens":512}'
 
 # Anthropic 兼容流式并设置停止序列
 curl -X POST http://localhost:8000/v1/messages \
   -H "Content-Type: application/json" \
-  -d '{
-    "model": "astrai",
-    "messages": [{"role": "user", "content": "写个故事"}],
-    "max_tokens": 500,
-    "stream": true,
-    "stop_sequences": ["结束"]
-  }'
+  -d '{"model":"astrai","messages":[{"role":"user","content":"写个故事"}],"max_tokens":500,"stream":true,"stop_sequences":["结束"]}'
 
 # 健康检查
 curl http://localhost:8000/health
 ```
 
-#### 演示
-
-查看 `scripts/demo/` 文件夹中的演示：
-
-```bash
-# 下载模型权重（运行演示前必需）
-python scripts/demo/download.py
-
-# 交互式流式聊天
-python scripts/demo/stream_chat.py
-
-# 批量生成
-python scripts/demo/generate_batch.py
-
-# 自回归生成
-python scripts/demo/generate_ar.py
-```
-
-观看 [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6) 上的视频演示。
+SSE 流式格式、错误码和统计端点详见[推理文档](./inference.md)。
 
 ### 文档
 
 | 文档 | 说明 |
 |------|------|
-| [参数说明](./params.md) | 训练与推理参数配置 |
+| [CLI 参考](./params.md) | 所有 CLI 工具参数（训练、服务、生成、预处理） |
 | [架构文档](./architecture.md) | 系统架构、类图与设计模式 |
 | [训练文档](./training.md) | 训练循环、策略与公式 |
 | [推理文档](./inference.md) | KVCache、连续批处理、采样与 HTTP API |
diff --git a/assets/docs/architecture.md b/assets/docs/architecture.md
index a57a338..2590e63 100644
--- a/assets/docs/architecture.md
+++ b/assets/docs/architecture.md
@@ -1,5 +1,12 @@
 # AstrAI Architecture
 
+## Contents
+
+- [Class Diagram](#class-diagram) — Full Mermaid class diagram across 10+ namespaces
+- [Module Overview](#module-overview) — Component inventory per module
+- [Design Patterns](#design-patterns) — 13 documented patterns with classes
+- [Core Relationships](#core-relationships) — 11 key inter-component relationships
+
 ## Class Diagram
 
 ```mermaid
diff --git a/assets/docs/dataflow.md b/assets/docs/dataflow.md
index df5f599..08cef80 100644
--- a/assets/docs/dataflow.md
+++ b/assets/docs/dataflow.md
@@ -1,17 +1,58 @@
 # Data Flow
 
-This document describes the data pipeline: from raw text to model input tensors.
+This document describes the data pipeline: from raw text to model input tensors. For creating preprocessing configs, see [Preprocessing Guide](preprocessing.md).
+
+## Contents
+
+- [Overview](#overview)
+- [Data Preparation](#data-preparation) — tokenization, format detection, backends
+- [Data Keys by Training Type](#data-keys-by-training-type)
+- [Dataset Architecture](#dataset-architecture)
+- [Sampler](#sampler)
+- [DataLoader](#dataloader)
 
 ## Overview
 
 ```
-Raw Text → AutoTokenizer → Token IDs → .h5/.bin → Store.load() → Store.fetch() → Dataset → Sampler → DataLoader → Training/Inference
+JSONL Lines → Pipeline (mask builder) → Tokenized Tensors
+                                              ↓
+                                      .h5 or .bin storage
+                                              ↓
+                                      Store.load()
+                                              ↓
+                                      Store.fetch(begin, end, keys)
+                                              ↓
+                                      BaseDataset.__getitem__(idx)
+                                              ↓
+                                      Sampler → DataLoader → Training / Inference
 ```
 
 ## Data Preparation
 
 Raw text is tokenized via `AutoTokenizer.encode()` and saved as HDF5 (`.h5`) or binary (`.bin` + `meta.json`) files with keyed tensor groups.
 
+### Tokenization
+
+The `Pipeline` reads JSONL lines, applies the mask builder (see [Preprocessing](preprocessing.md)), and produces flat token sequences:
+
+```python
+# Per JSONL line: messages → chat template → token IDs + loss mask
+tokens = tokenizer.encode(rendered_text)        # List[int]
+loss_mask = [0, 0, 0, 1, 1, 1, 1, 1, 1]        # 0=masked, 1=train
+# Stored as flat tensors, packed with other lines by packing strategy
+```
+
+The output `meta.json` records the storage format, key names, dtype, total token count, and tensor shapes for each shard.
+
+### Format Detection
+
+`detect_format(load_path)` inspects the directory:
+
+- If `*.h5` files exist → `"h5"` (HDF5 backend)
+- If `*.bin` + `meta.json` files exist → `"bin"` (memory-mapped backend)
+
+### Store Backends
+
 Storage format is auto-detected by `detect_format()`; backends are dispatched via registry:
 
 ```
@@ -19,7 +60,11 @@ StoreFactory.create("h5")  → H5Store
 StoreFactory.create("bin") → MmapStore
 ```
 
-H5 backend supports shared memory via `.share_memory_()`. Bin (mmap) uses OS page-cache sharing natively.
+**H5Store**: Reads HDF5 files, supports `share_memory_()` for multi-process DataLoader workers (copies tensors to shared memory).
+
+**MmapStore**: Memory-maps `.bin` files. OS page cache sharing is native — no explicit `share_memory_()` needed. Uses `torch.from_numpy(np.memmap(...))`.
+
+Both backends normalise tensors into `Store._data[Dict[str, List[Tensor]]]` + `Store._cum[Dict[str, List[int]]]` (cumulative lengths for bisect-based indexing).
 
 ## Data Keys by Training Type
 
@@ -61,4 +106,4 @@ DatasetFactory.load(train_type, load_path, window_size, stride=None, storage_typ
 
 Standard PyTorch `DataLoader` with configurable `batch_size`, `num_workers`, `pin_memory`, `prefetch_factor`. Sampler produces indices; dataloader fetches tensor batches via `__getitem__`.
 
-> Document Update Time: 2026-05-30
+> Document Update Time: 2026-06-19
diff --git a/assets/docs/inference.md b/assets/docs/inference.md
index 54435c5..764058b 100644
--- a/assets/docs/inference.md
+++ b/assets/docs/inference.md
@@ -1,5 +1,16 @@
 # Inference
 
+## Contents
+
+- [KV Cache](#kv-cache)
+- [KVCache System](#kvcache-system)
+- [Continuous Batching](#continuous-batching)
+- [Sampling](#sampling-strategy-pattern)
+- [Protocol Handlers](#protocol-handlers-strategy-pattern)
+- [Engine & GenerateResult](#engine--generateresult)
+- [HTTP API](#http-api) — endpoints, SSE, errors, stats
+- [Engine API](#engine-api)
+
 ## KV Cache
 
 At decode time, only the last query token matters. All previous K/V are cached to avoid recomputation:
@@ -133,6 +144,92 @@ Supports `stop_sequences` and streaming via `event: content_block_delta`.
 | `max_tokens` | Optional[int] | None | Max generation length |
 | `stream` | bool | False | Stream output |
 
+### SSE Streaming Format
+
+**OpenAI** (`/v1/chat/completions`, `stream=true`):
+
+```
+data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"astrai",
+       "choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]}
+
+data: {"id":"chatcmpl-...","object":"chat.completion.chunk",...,
+       "choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]}
+
+data: {"id":"chatcmpl-...","object":"chat.completion.chunk",...,
+       "choices":[{"index":0,"delta":{},"finish_reason":"stop"}],
+       "usage":{"prompt_tokens":5,"completion_tokens":1,"total_tokens":6}}
+
+data: [DONE]
+```
+
+**Anthropic** (`/v1/messages`, `stream=true`):
+
+```
+event: message_start
+data: {"type":"message_start","message":{"id":"msg_...","model":"astrai","role":"assistant",
+       "content":[],"stop_reason":null,...}}
+
+event: content_block_start
+data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}}
+
+event: content_block_delta
+data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}}
+
+event: content_block_stop
+data: {"type":"content_block_stop","index":0}
+
+event: message_delta
+data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{...}}
+
+event: message_stop
+data: {"type":"message_stop"}
+```
+
+### Error Responses
+
+All endpoints use standard HTTP status codes:
+
+| Status | Meaning |
+|--------|---------|
+| 200 | Success |
+| 400 | Invalid request (bad JSON, missing fields, validation error) |
+| 405 | Method not allowed |
+| 422 | Unprocessable entity (Pydantic validation) |
+| 500 | Internal server error (model crash, OOM, scheduler failure) |
+| 503 | Service unavailable (model not loaded, engine not ready) |
+
+Error response body:
+
+```json
+{
+    "error": {
+        "message": "Invalid request: max_tokens must be > 0",
+        "type": "invalid_request_error",
+        "code": 400
+    }
+}
+```
+
+### Stats Endpoint
+
+```
+GET /stats
+```
+
+Response:
+
+```json
+{
+    "active_requests": 3,
+    "waiting_requests": 2,
+    "total_requests": 128,
+    "cache_usage": 0.45,
+    "tokens_generated": 10240
+}
+```
+
+`cache_usage` is the fraction of KV cache pages currently in use (0.0–1.0).
+
 ## Engine API
 
 ```python
@@ -149,4 +246,4 @@ async for token in engine.generate_async("Hello", ...):    # -> AsyncGenerator[s
     print(token)
 ```
 
-> Document Update Time: 2026-05-30
+> Document Update Time: 2026-06-19
diff --git a/assets/docs/params.md b/assets/docs/params.md
index 65150f3..4526b19 100644
--- a/assets/docs/params.md
+++ b/assets/docs/params.md
@@ -1,4 +1,11 @@
-# Parameter Documentation
+# CLI Parameter Reference
+
+## Contents
+
+- [Training Parameters](#training-parameters)
+- [Inference Server](#inference-server-serverpy)
+- [Generate](#generate-generatepy)
+- [Preprocess](#preprocess-preprocesspy)
 
 ## Training Parameters
 
@@ -122,4 +129,64 @@ nohup python scripts/tools/train.py \
 
 ---
 
-> Document Update Time: 2026-05-24
\ No newline at end of file
+## Inference Server (`server.py`)
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `--host` | str | `0.0.0.0` | Host address |
+| `--port` | int | `8000` | Port number |
+| `--param_path` | path | `project_root/params` | Path to model parameters |
+| `--device` | str | `cuda` | Device to load model on |
+| `--dtype` | str | `bfloat16` | Model weights dtype (`bfloat16`, `float16`, `float32`) |
+| `--max_batch_size` | int | `16` | Maximum batch size for continuous batching |
+| `--reload` | flag | `False` | Enable auto-reload for development |
+
+Usage:
+```bash
+python scripts/tools/server.py --param_path ./params --device cuda --dtype bfloat16
+```
+
+See [Inference Guide](inference.md) for HTTP API documentation.
+
+## Generate (`generate.py`)
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `--param_path` | str | required | Path to the model directory |
+| `--input_json_file` | str | required | Path to the input JSONL file |
+| `--output_json_file` | str | required | Path to the output JSONL file |
+| `--question_key` | str | `question` | Key for the question in input JSON |
+| `--response_key` | str | `response` | Key for the response in output JSON |
+| `--temperature` | float | `0.60` | Sampling temperature |
+| `--top_k` | int | `30` | Top-k filtering |
+| `--top_p` | float | `0.95` | Nucleus sampling threshold |
+| `--batch_size` | int | `1` | Batch size for generation |
+| `--max_tokens` | int | `2048` | Maximum tokens to generate |
+
+Usage:
+```bash
+python scripts/tools/generate.py \
+    --param_path ./params \
+    --input_json_file input.jsonl \
+    --output_json_file output.jsonl
+```
+
+## Preprocess (`preprocess.py`)
+
+| Parameter | Type | Default | Description |
+|-----------|------|---------|-------------|
+| `input_files` | path(s) | required | Input JSONL file(s), supports glob (`data/*.jsonl`) |
+| `--output_dir`, `-o` | path | required | Output directory for processed data |
+| `--config`, `-c` | path | required | Preprocessing pipeline config (JSON) |
+| `--num_workers` | int | `4` | Number of parallel workers |
+
+Usage:
+```bash
+python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json
+```
+
+See [Preprocessing Guide](preprocessing.md) for config file format and examples.
+
+---
+
+> Document Update Time: 2026-06-19
\ No newline at end of file
diff --git a/assets/docs/preprocessing.md b/assets/docs/preprocessing.md
index de825b7..c049288 100644
--- a/assets/docs/preprocessing.md
+++ b/assets/docs/preprocessing.md
@@ -2,6 +2,17 @@
 
 Declarative JSON-driven data preprocessing. One `SectionedMaskBuilder` handles all formats via `input.sections` (single-output) or `input.sources` (multi-output).
 
+## Contents
+
+- [Philosophy](#philosophy)
+- [Config Structure](#config-structure)
+- [Quick Start](#quick-start) — SFT Chat, SFT Instruction, Pretrain, DPO, GRPO examples
+- [Configuration Reference](#configuration-reference) — all fields
+- [Mask Algorithm](#mask-algorithm)
+- [Output Layout](#output-layout)
+- [CLI](#cli)
+- [Python API](#python-api)
+
 ## Philosophy
 
 | Component | Responsibility |
diff --git a/assets/docs/training.md b/assets/docs/training.md
index a885361..443aaec 100644
--- a/assets/docs/training.md
+++ b/assets/docs/training.md
@@ -1,5 +1,18 @@
 # Training
 
+## Contents
+
+- [Autoregression](#autoregression)
+- [Causal Mask](#causal-mask)
+- [Rotary Position Embedding (RoPE)](#rotary-position-embedding-rope)
+- [Training Loop](#training-loop)
+- [Strategies](#strategies) — SEQ, SFT, DPO, GRPO
+- [LR Schedulers](#lr-schedulers)
+- [Gradient Checkpointing](#gradient-checkpointing)
+- [Checkpoint](#checkpoint)
+- [TrainContextBuilder](#traincontextbuilder-builder-pattern)
+- [Training CLI](#training-cli)
+
 ### Autoregression
 
 Given a token sequence, the model predicts the probability of the next token. Each generated token is appended to the input and fed back, repeating until an end-of-sequence token or max length.