From 8ab7564d02a524881af8bd87e05ad7129914240b Mon Sep 17 00:00:00 2001 From: ViperEkura <3081035982@qq.com> Date: Fri, 19 Jun 2026 13:52:32 +0800 Subject: [PATCH] =?UTF-8?q?docs:=20=E9=87=8D=E6=9E=84=20README=20=E7=BB=93?= =?UTF-8?q?=E6=9E=84=EF=BC=8C=E5=85=A8=E6=96=87=E6=A1=A3=E6=B7=BB=E5=8A=A0?= =?UTF-8?q?=E7=9B=AE=E5=BD=95=E5=AF=BC=E8=88=AA?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit - README 新增 Getting Started 端到端流程,整合快速开始与演示,去重精简 - 中文 README 同步英文版结构,预处理配置改用 seq 策略 - inference.md 补充 SSE 流式格式、错误响应、/stats 端点文档 - params.md 扩展为 CLI 参考,覆盖 server/generate/preprocess 参数表 - dataflow.md 拆分 tokenization/format detection/backend 子节,新增流程图 - architecture/training/inference/preprocessing 均添加目录导航 - 移除 README CI badge --- README.md | 147 +++++++++++++++++------------------ assets/docs/README-zh-CN.md | 145 +++++++++++++++++----------------- assets/docs/architecture.md | 7 ++ assets/docs/dataflow.md | 53 ++++++++++++- assets/docs/inference.md | 99 ++++++++++++++++++++++- assets/docs/params.md | 71 ++++++++++++++++- assets/docs/preprocessing.md | 11 +++ assets/docs/training.md | 13 ++++ 8 files changed, 392 insertions(+), 154 deletions(-) diff --git a/README.md b/README.md index b5c9369..26482dc 100644 --- a/README.md +++ b/README.md @@ -12,7 +12,6 @@ release stars forks - ci
@@ -29,7 +28,8 @@ ## 📖 Table of Contents - [Features](#features) -- [Quick Start](#quick-start) +- [Getting Started](#getting-started) +- [Demo](#demo) - [Documentation](#documentation) - [Contributing](#contributing) - [Community](#community) @@ -50,33 +50,43 @@ - 🤗 **HuggingFace-Style API**: AutoModel/AutoTokenizer APIs inspired by HuggingFace for easy model and tokenizer loading. - 🔌 **Dual API Compatibility**: Supports both OpenAI and Anthropic chat completion APIs out of the box. -### Quick Start +### Getting Started -#### Installation +End-to-end walkthrough in 5 steps: + +**1. Install** ```bash git clone https://github.com/ViperEkura/AstrAI.git cd AstrAI pip install -e . +# pip install -e ".[dev]" # optional: dev dependencies (pytest, ruff) ``` -For development dependencies: +**2. Download model** ```bash -pip install -e ".[dev]" +python scripts/demo/download.py # downloads 1B checkpoint to params/ ``` -#### Download Pre-trained Model +**3. Preprocess data** -Download pre-trained model weights (1B bilingual checkpoint) to `params/`: +Create `pretrain.json` (preprocessing config for `seq` strategy): + +```json +{ + "version": 1, + "input": {"sections": [{"field": "text", "action": "train"}]}, + "preprocessing": {"max_seq_len": 2048}, + "output": {"storage_format": "bin"} +} +``` ```bash -python scripts/demo/download.py +python scripts/tools/preprocess.py data/*.jsonl -o output/ -c pretrain.json ``` -Or download manually from [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) into `params/`. - -#### Train a Model +**4. Train** ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 @@ -103,15 +113,54 @@ nohup python scripts/tools/train.py \ > out.log 2> err.log & ``` -Full reference at [Parameter Guide](assets/docs/params.md). +**5. Serve & query** -#### Generate Text +```bash +# Terminal 1: start server +python scripts/tools/server.py --param_path ./params --device cuda + +# Terminal 2: query +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"messages":[{"role":"user","content":"Hello"}],"max_tokens":512}' +``` + +### Demo + +Check out the demos in the `scripts/demo/` folder: + +```bash +# Download model weights (required before running demos) +python scripts/demo/download.py # model → params/ + +# Interactive streaming chat (multi-turn, maintains history) +python scripts/demo/stream_chat.py +# Type your message after >>, type !exit to quit + +# Batch generation (5 hardcoded prompts, non-streaming) +python scripts/demo/generate_batch.py + +# Single-prompt autoregressive streaming +python scripts/demo/generate_ar.py +``` + +All generation demos use `temperature=0.8`, `top_p=0.95`, `top_k=50`, `max_tokens=2048` by default and require `params/` to contain model weights (run `download.py` first). + +Watch a video walkthrough on [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6). + +--- + +See [Documentation](#documentation) for full references beyond the examples above. + +#### Text Generation + +Batch generation from a JSONL file: ```bash python scripts/tools/generate.py \ - --param_path /path/to/model \ - --input_json_file /path/to/input.jsonl \ - --output_json_file /path/to/output.jsonl + --param_path ./params \ + --input_json_file input.jsonl \ + --output_json_file output.jsonl ``` #### Docker @@ -125,9 +174,6 @@ docker build -t astrai:latest . # Run with GPU support docker run --gpus all -it astrai:latest -# Run with specific GPUs -docker run --gpus '"device=0,1"' -it astrai:latest - # Run inference server docker run --gpus all -p 8000:8000 astrai:latest \ python -m scripts.tools.server --port 8000 --device cuda @@ -144,84 +190,37 @@ docker compose --profile cpu up -d > **Note**: `--gpus all` is required for CUDA support. Without it, `torch.cuda.is_available()` will return `False`. -#### Start HTTP Server +#### HTTP API Examples -Start the inference server with OpenAI and Anthropic-compatible HTTP API: +Additional request examples beyond the [Getting Started](#getting-started) flow: ```bash -python -m scripts.tools.server --port 8000 --device cuda -``` - -Make requests: - -```bash -# OpenAI-compatible -curl -X POST http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "messages": [{"role": "user", "content": "Hello"}], - "max_tokens": 512 - }' - # OpenAI-compatible streaming curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ - -d '{ - "messages": [{"role": "user", "content": "Tell a story"}], - "stream": true, - "max_tokens": 500 - }' + -d '{"messages":[{"role":"user","content":"Tell a story"}],"stream":true,"max_tokens":500}' # Anthropic-compatible curl -X POST http://localhost:8000/v1/messages \ -H "Content-Type: application/json" \ - -d '{ - "model": "astrai", - "system": "You are a helpful assistant.", - "messages": [{"role": "user", "content": "Hello"}], - "max_tokens": 512 - }' + -d '{"model":"astrai","system":"You are a helpful assistant.","messages":[{"role":"user","content":"Hello"}],"max_tokens":512}' # Anthropic-compatible streaming with stop sequences curl -X POST http://localhost:8000/v1/messages \ -H "Content-Type: application/json" \ - -d '{ - "model": "astrai", - "messages": [{"role": "user", "content": "Write a story"}], - "max_tokens": 500, - "stream": true, - "stop_sequences": ["The end"] - }' + -d '{"model":"astrai","messages":[{"role":"user","content":"Write a story"}],"max_tokens":500,"stream":true,"stop_sequences":["The end"]}' # Health check curl http://localhost:8000/health ``` -#### Demo - -Check out the demos in the `scripts/demo/` folder: - -```bash -# Download model weights (required before running demos) -python scripts/demo/download.py - -# Interactive streaming chat -python scripts/demo/stream_chat.py - -# Batch generation -python scripts/demo/generate_batch.py - -# Auto‑regressive generation -python scripts/demo/generate_ar.py -``` - -Watch a video walkthrough on [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6). +See [Inference Guide](assets/docs/inference.md) for SSE streaming format, error codes, and stats endpoint. ### Documentation | Document | Description | |----------|-------------| -| [Parameter Guide](./assets/docs/params.md) | Training & inference parameters | +| [CLI Reference](./assets/docs/params.md) | Parameters for all CLI tools (train, server, generate, preprocess) | | [Architecture](./assets/docs/architecture.md) | System architecture, class diagram & design patterns | | [Training](./assets/docs/training.md) | Training loop, strategies & formulas | | [Inference](./assets/docs/inference.md) | KVCache, continuous batching, sampling & HTTP API | diff --git a/assets/docs/README-zh-CN.md b/assets/docs/README-zh-CN.md index e0063b3..a1c1f29 100644 --- a/assets/docs/README-zh-CN.md +++ b/assets/docs/README-zh-CN.md @@ -18,7 +18,6 @@ release stars forks - ci
@@ -35,7 +34,8 @@ ## 📖 目录 - [特性](#特性) -- [快速开始](#快速开始) +- [快速上手](#快速上手) +- [演示](#演示) - [文档](#文档) - [贡献](#贡献) - [社区](#社区) @@ -56,33 +56,43 @@ - 🤗 **HuggingFace 风格 API**: 类 HuggingFace 的 AutoModel/AutoTokenizer 接口,方便加载模型和分词器。 - 🔌 **双 API 兼容**: 同时支持 OpenAI 和 Anthropic 聊天补全 API,开箱即用。 -### 快速开始 +### 快速上手 -#### 安装 +端到端演示,只需 5 步: + +**1. 安装** ```bash git clone https://github.com/ViperEkura/AstrAI.git cd AstrAI pip install -e . +# pip install -e ".[dev]" # 可选:开发依赖(pytest, ruff) ``` -安装开发依赖: +**2. 下载模型** ```bash -pip install -e ".[dev]" +python scripts/demo/download.py # 下载 1B 检查点到 params/ ``` -#### 下载预训练模型 +**3. 预处理数据** -下载预训练模型权重(1B 双语检查点)到 `params/` 目录: +创建 `pretrain.json`(`seq` 策略的预处理配置): + +```json +{ + "version": 1, + "input": {"sections": [{"field": "text", "action": "train"}]}, + "preprocessing": {"max_seq_len": 2048}, + "output": {"storage_format": "bin"} +} +``` ```bash -python scripts/demo/download.py +python scripts/tools/preprocess.py data/*.jsonl -o output/ -c pretrain.json ``` -或从 [HuggingFace](https://huggingface.co/ViperEk/KHAOSZ) 手动下载放入 `params/`。 - -#### 训练模型 +**4. 训练** ```bash export CUDA_VISIBLE_DEVICES=0,1,2,3 @@ -109,15 +119,54 @@ nohup python scripts/tools/train.py \ > out.log 2> err.log & ``` -完整参数列表见[参数说明](./params.md)。 +**5. 启动服务并调用** + +```bash +# 终端 1:启动服务 +python scripts/tools/server.py --param_path ./params --device cuda + +# 终端 2:发起请求 +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{"messages":[{"role":"user","content":"你好"}],"max_tokens":512}' +``` + +### 演示 + +查看 `scripts/demo/` 文件夹中的演示: + +```bash +# 下载模型权重(运行演示前必需) +python scripts/demo/download.py # model → params/ + +# 交互式流式聊天(多轮对话,保持历史记录) +python scripts/demo/stream_chat.py +# 在 >> 后输入消息,输入 !exit 退出 + +# 批量生成(5 条硬编码提示词,非流式) +python scripts/demo/generate_batch.py + +# 单条提示词自回归流式生成 +python scripts/demo/generate_ar.py +``` + +所有生成演示默认使用 `temperature=0.8`、`top_p=0.95`、`top_k=50`、`max_tokens=2048`,需要 `params/` 目录包含模型权重(请先运行 `download.py`)。 + +观看 [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6) 上的视频演示。 + +--- + +更多选项请参考[文档](#文档)。 #### 文本生成 +从 JSONL 文件批量生成: + ```bash python scripts/tools/generate.py \ - --param_path /path/to/model \ - --input_json_file /path/to/input.jsonl \ - --output_json_file /path/to/output.jsonl + --param_path ./params \ + --input_json_file input.jsonl \ + --output_json_file output.jsonl ``` #### Docker @@ -131,9 +180,6 @@ docker build -t astrai:latest . # 启用 GPU 运行 docker run --gpus all -it astrai:latest -# 指定特定 GPU -docker run --gpus '"device=0,1"' -it astrai:latest - # 运行推理服务 docker run --gpus all -p 8000:8000 astrai:latest \ python -m scripts.tools.server --port 8000 --device cuda @@ -150,84 +196,37 @@ docker compose --profile cpu up -d > **注意**: 必须使用 `--gpus all` 才能启用 CUDA 支持,否则 `torch.cuda.is_available()` 将返回 `False`。 -#### 启动 HTTP 服务 +#### HTTP API 示例 -启动推理服务器,支持 OpenAI 和 Anthropic 兼容的 HTTP API: +除[快速上手](#快速上手)流程外,更多请求示例: ```bash -python -m scripts.tools.server --port 8000 --device cuda -``` - -发起请求: - -```bash -# OpenAI 兼容 -curl -X POST http://localhost:8000/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "messages": [{"role": "user", "content": "你好"}], - "max_tokens": 512 - }' - # OpenAI 兼容流式 curl -X POST http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ - -d '{ - "messages": [{"role": "user", "content": "讲个故事"}], - "stream": true, - "max_tokens": 500 - }' + -d '{"messages":[{"role":"user","content":"讲个故事"}],"stream":true,"max_tokens":500}' # Anthropic 兼容 curl -X POST http://localhost:8000/v1/messages \ -H "Content-Type: application/json" \ - -d '{ - "model": "astrai", - "system": "你是一个乐于助人的助手。", - "messages": [{"role": "user", "content": "你好"}], - "max_tokens": 512 - }' + -d '{"model":"astrai","system":"你是一个乐于助人的助手。","messages":[{"role":"user","content":"你好"}],"max_tokens":512}' # Anthropic 兼容流式并设置停止序列 curl -X POST http://localhost:8000/v1/messages \ -H "Content-Type: application/json" \ - -d '{ - "model": "astrai", - "messages": [{"role": "user", "content": "写个故事"}], - "max_tokens": 500, - "stream": true, - "stop_sequences": ["结束"] - }' + -d '{"model":"astrai","messages":[{"role":"user","content":"写个故事"}],"max_tokens":500,"stream":true,"stop_sequences":["结束"]}' # 健康检查 curl http://localhost:8000/health ``` -#### 演示 - -查看 `scripts/demo/` 文件夹中的演示: - -```bash -# 下载模型权重(运行演示前必需) -python scripts/demo/download.py - -# 交互式流式聊天 -python scripts/demo/stream_chat.py - -# 批量生成 -python scripts/demo/generate_batch.py - -# 自回归生成 -python scripts/demo/generate_ar.py -``` - -观看 [bilibili](https://www.bilibili.com/video/BV1fuLB6yEj6) 上的视频演示。 +SSE 流式格式、错误码和统计端点详见[推理文档](./inference.md)。 ### 文档 | 文档 | 说明 | |------|------| -| [参数说明](./params.md) | 训练与推理参数配置 | +| [CLI 参考](./params.md) | 所有 CLI 工具参数(训练、服务、生成、预处理) | | [架构文档](./architecture.md) | 系统架构、类图与设计模式 | | [训练文档](./training.md) | 训练循环、策略与公式 | | [推理文档](./inference.md) | KVCache、连续批处理、采样与 HTTP API | diff --git a/assets/docs/architecture.md b/assets/docs/architecture.md index a57a338..2590e63 100644 --- a/assets/docs/architecture.md +++ b/assets/docs/architecture.md @@ -1,5 +1,12 @@ # AstrAI Architecture +## Contents + +- [Class Diagram](#class-diagram) — Full Mermaid class diagram across 10+ namespaces +- [Module Overview](#module-overview) — Component inventory per module +- [Design Patterns](#design-patterns) — 13 documented patterns with classes +- [Core Relationships](#core-relationships) — 11 key inter-component relationships + ## Class Diagram ```mermaid diff --git a/assets/docs/dataflow.md b/assets/docs/dataflow.md index df5f599..08cef80 100644 --- a/assets/docs/dataflow.md +++ b/assets/docs/dataflow.md @@ -1,17 +1,58 @@ # Data Flow -This document describes the data pipeline: from raw text to model input tensors. +This document describes the data pipeline: from raw text to model input tensors. For creating preprocessing configs, see [Preprocessing Guide](preprocessing.md). + +## Contents + +- [Overview](#overview) +- [Data Preparation](#data-preparation) — tokenization, format detection, backends +- [Data Keys by Training Type](#data-keys-by-training-type) +- [Dataset Architecture](#dataset-architecture) +- [Sampler](#sampler) +- [DataLoader](#dataloader) ## Overview ``` -Raw Text → AutoTokenizer → Token IDs → .h5/.bin → Store.load() → Store.fetch() → Dataset → Sampler → DataLoader → Training/Inference +JSONL Lines → Pipeline (mask builder) → Tokenized Tensors + ↓ + .h5 or .bin storage + ↓ + Store.load() + ↓ + Store.fetch(begin, end, keys) + ↓ + BaseDataset.__getitem__(idx) + ↓ + Sampler → DataLoader → Training / Inference ``` ## Data Preparation Raw text is tokenized via `AutoTokenizer.encode()` and saved as HDF5 (`.h5`) or binary (`.bin` + `meta.json`) files with keyed tensor groups. +### Tokenization + +The `Pipeline` reads JSONL lines, applies the mask builder (see [Preprocessing](preprocessing.md)), and produces flat token sequences: + +```python +# Per JSONL line: messages → chat template → token IDs + loss mask +tokens = tokenizer.encode(rendered_text) # List[int] +loss_mask = [0, 0, 0, 1, 1, 1, 1, 1, 1] # 0=masked, 1=train +# Stored as flat tensors, packed with other lines by packing strategy +``` + +The output `meta.json` records the storage format, key names, dtype, total token count, and tensor shapes for each shard. + +### Format Detection + +`detect_format(load_path)` inspects the directory: + +- If `*.h5` files exist → `"h5"` (HDF5 backend) +- If `*.bin` + `meta.json` files exist → `"bin"` (memory-mapped backend) + +### Store Backends + Storage format is auto-detected by `detect_format()`; backends are dispatched via registry: ``` @@ -19,7 +60,11 @@ StoreFactory.create("h5") → H5Store StoreFactory.create("bin") → MmapStore ``` -H5 backend supports shared memory via `.share_memory_()`. Bin (mmap) uses OS page-cache sharing natively. +**H5Store**: Reads HDF5 files, supports `share_memory_()` for multi-process DataLoader workers (copies tensors to shared memory). + +**MmapStore**: Memory-maps `.bin` files. OS page cache sharing is native — no explicit `share_memory_()` needed. Uses `torch.from_numpy(np.memmap(...))`. + +Both backends normalise tensors into `Store._data[Dict[str, List[Tensor]]]` + `Store._cum[Dict[str, List[int]]]` (cumulative lengths for bisect-based indexing). ## Data Keys by Training Type @@ -61,4 +106,4 @@ DatasetFactory.load(train_type, load_path, window_size, stride=None, storage_typ Standard PyTorch `DataLoader` with configurable `batch_size`, `num_workers`, `pin_memory`, `prefetch_factor`. Sampler produces indices; dataloader fetches tensor batches via `__getitem__`. -> Document Update Time: 2026-05-30 +> Document Update Time: 2026-06-19 diff --git a/assets/docs/inference.md b/assets/docs/inference.md index 54435c5..764058b 100644 --- a/assets/docs/inference.md +++ b/assets/docs/inference.md @@ -1,5 +1,16 @@ # Inference +## Contents + +- [KV Cache](#kv-cache) +- [KVCache System](#kvcache-system) +- [Continuous Batching](#continuous-batching) +- [Sampling](#sampling-strategy-pattern) +- [Protocol Handlers](#protocol-handlers-strategy-pattern) +- [Engine & GenerateResult](#engine--generateresult) +- [HTTP API](#http-api) — endpoints, SSE, errors, stats +- [Engine API](#engine-api) + ## KV Cache At decode time, only the last query token matters. All previous K/V are cached to avoid recomputation: @@ -133,6 +144,92 @@ Supports `stop_sequences` and streaming via `event: content_block_delta`. | `max_tokens` | Optional[int] | None | Max generation length | | `stream` | bool | False | Stream output | +### SSE Streaming Format + +**OpenAI** (`/v1/chat/completions`, `stream=true`): + +``` +data: {"id":"chatcmpl-...","object":"chat.completion.chunk","created":...,"model":"astrai", + "choices":[{"index":0,"delta":{"role":"assistant"},"finish_reason":null}]} + +data: {"id":"chatcmpl-...","object":"chat.completion.chunk",..., + "choices":[{"index":0,"delta":{"content":"Hello"},"finish_reason":null}]} + +data: {"id":"chatcmpl-...","object":"chat.completion.chunk",..., + "choices":[{"index":0,"delta":{},"finish_reason":"stop"}], + "usage":{"prompt_tokens":5,"completion_tokens":1,"total_tokens":6}} + +data: [DONE] +``` + +**Anthropic** (`/v1/messages`, `stream=true`): + +``` +event: message_start +data: {"type":"message_start","message":{"id":"msg_...","model":"astrai","role":"assistant", + "content":[],"stop_reason":null,...}} + +event: content_block_start +data: {"type":"content_block_start","index":0,"content_block":{"type":"text","text":""}} + +event: content_block_delta +data: {"type":"content_block_delta","index":0,"delta":{"type":"text_delta","text":"Hello"}} + +event: content_block_stop +data: {"type":"content_block_stop","index":0} + +event: message_delta +data: {"type":"message_delta","delta":{"stop_reason":"end_turn"},"usage":{...}} + +event: message_stop +data: {"type":"message_stop"} +``` + +### Error Responses + +All endpoints use standard HTTP status codes: + +| Status | Meaning | +|--------|---------| +| 200 | Success | +| 400 | Invalid request (bad JSON, missing fields, validation error) | +| 405 | Method not allowed | +| 422 | Unprocessable entity (Pydantic validation) | +| 500 | Internal server error (model crash, OOM, scheduler failure) | +| 503 | Service unavailable (model not loaded, engine not ready) | + +Error response body: + +```json +{ + "error": { + "message": "Invalid request: max_tokens must be > 0", + "type": "invalid_request_error", + "code": 400 + } +} +``` + +### Stats Endpoint + +``` +GET /stats +``` + +Response: + +```json +{ + "active_requests": 3, + "waiting_requests": 2, + "total_requests": 128, + "cache_usage": 0.45, + "tokens_generated": 10240 +} +``` + +`cache_usage` is the fraction of KV cache pages currently in use (0.0–1.0). + ## Engine API ```python @@ -149,4 +246,4 @@ async for token in engine.generate_async("Hello", ...): # -> AsyncGenerator[s print(token) ``` -> Document Update Time: 2026-05-30 +> Document Update Time: 2026-06-19 diff --git a/assets/docs/params.md b/assets/docs/params.md index 65150f3..4526b19 100644 --- a/assets/docs/params.md +++ b/assets/docs/params.md @@ -1,4 +1,11 @@ -# Parameter Documentation +# CLI Parameter Reference + +## Contents + +- [Training Parameters](#training-parameters) +- [Inference Server](#inference-server-serverpy) +- [Generate](#generate-generatepy) +- [Preprocess](#preprocess-preprocesspy) ## Training Parameters @@ -122,4 +129,64 @@ nohup python scripts/tools/train.py \ --- -> Document Update Time: 2026-05-24 \ No newline at end of file +## Inference Server (`server.py`) + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `--host` | str | `0.0.0.0` | Host address | +| `--port` | int | `8000` | Port number | +| `--param_path` | path | `project_root/params` | Path to model parameters | +| `--device` | str | `cuda` | Device to load model on | +| `--dtype` | str | `bfloat16` | Model weights dtype (`bfloat16`, `float16`, `float32`) | +| `--max_batch_size` | int | `16` | Maximum batch size for continuous batching | +| `--reload` | flag | `False` | Enable auto-reload for development | + +Usage: +```bash +python scripts/tools/server.py --param_path ./params --device cuda --dtype bfloat16 +``` + +See [Inference Guide](inference.md) for HTTP API documentation. + +## Generate (`generate.py`) + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `--param_path` | str | required | Path to the model directory | +| `--input_json_file` | str | required | Path to the input JSONL file | +| `--output_json_file` | str | required | Path to the output JSONL file | +| `--question_key` | str | `question` | Key for the question in input JSON | +| `--response_key` | str | `response` | Key for the response in output JSON | +| `--temperature` | float | `0.60` | Sampling temperature | +| `--top_k` | int | `30` | Top-k filtering | +| `--top_p` | float | `0.95` | Nucleus sampling threshold | +| `--batch_size` | int | `1` | Batch size for generation | +| `--max_tokens` | int | `2048` | Maximum tokens to generate | + +Usage: +```bash +python scripts/tools/generate.py \ + --param_path ./params \ + --input_json_file input.jsonl \ + --output_json_file output.jsonl +``` + +## Preprocess (`preprocess.py`) + +| Parameter | Type | Default | Description | +|-----------|------|---------|-------------| +| `input_files` | path(s) | required | Input JSONL file(s), supports glob (`data/*.jsonl`) | +| `--output_dir`, `-o` | path | required | Output directory for processed data | +| `--config`, `-c` | path | required | Preprocessing pipeline config (JSON) | +| `--num_workers` | int | `4` | Number of parallel workers | + +Usage: +```bash +python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json +``` + +See [Preprocessing Guide](preprocessing.md) for config file format and examples. + +--- + +> Document Update Time: 2026-06-19 \ No newline at end of file diff --git a/assets/docs/preprocessing.md b/assets/docs/preprocessing.md index de825b7..c049288 100644 --- a/assets/docs/preprocessing.md +++ b/assets/docs/preprocessing.md @@ -2,6 +2,17 @@ Declarative JSON-driven data preprocessing. One `SectionedMaskBuilder` handles all formats via `input.sections` (single-output) or `input.sources` (multi-output). +## Contents + +- [Philosophy](#philosophy) +- [Config Structure](#config-structure) +- [Quick Start](#quick-start) — SFT Chat, SFT Instruction, Pretrain, DPO, GRPO examples +- [Configuration Reference](#configuration-reference) — all fields +- [Mask Algorithm](#mask-algorithm) +- [Output Layout](#output-layout) +- [CLI](#cli) +- [Python API](#python-api) + ## Philosophy | Component | Responsibility | diff --git a/assets/docs/training.md b/assets/docs/training.md index a885361..443aaec 100644 --- a/assets/docs/training.md +++ b/assets/docs/training.md @@ -1,5 +1,18 @@ # Training +## Contents + +- [Autoregression](#autoregression) +- [Causal Mask](#causal-mask) +- [Rotary Position Embedding (RoPE)](#rotary-position-embedding-rope) +- [Training Loop](#training-loop) +- [Strategies](#strategies) — SEQ, SFT, DPO, GRPO +- [LR Schedulers](#lr-schedulers) +- [Gradient Checkpointing](#gradient-checkpointing) +- [Checkpoint](#checkpoint) +- [TrainContextBuilder](#traincontextbuilder-builder-pattern) +- [Training CLI](#training-cli) + ### Autoregression Given a token sequence, the model predicts the probability of the next token. Each generated token is appended to the input and fed back, repeating until an end-of-sequence token or max length.