AstrAI/assets/docs/preprocessing.md

228 lines
6.6 KiB
Markdown

# Preprocessing Pipeline
Declarative JSON-driven data preprocessing. No code needed -- describe your input format and mask rules in a config file, the engine does the rest.
## Philosophy
| Component | Responsibility |
|-----------|---------------|
| `tokenizer_config.json` (`chat_template`) | Formatting -- how roles become tokens |
| `pipeline.json` (`mask`) | Masking -- which roles participate in training |
The two are fully decoupled. A single config file captures the entire pipeline, reusable and version-controllable. Extension is via factory registration (`@MaskBuilderFactory.register`) -- no need to touch existing code.
## Quick Start
### SFT Chat
```json
{
"version": 1,
"input": {
"type": "chat",
"messages_key": "messages"
},
"mask": {
"system": "mask",
"user": "mask",
"assistant": "train"
},
"mask_default": "mask",
"preprocessing": {
"max_seq_len": 2048,
"deduplicate": true
},
"output": {
"domain_key": "source",
"storage_format": "bin",
"max_tokens_per_shard": 100000000
}
}
```
Three lines of mask rules cover the most common SFT case: train on assistant turns, mask everything else.
### Instruction Tuning
```json
{
"version": 1,
"input": {
"type": "instruction",
"prompt_key": "instruction",
"response_key": "output"
},
"mask": {
"prompt": "mask",
"response": "train"
},
"mask_default": "mask",
"preprocessing": {
"max_seq_len": 2048
},
"output": {
"storage_format": "bin"
}
}
```
Mask splits at the prompt/response field boundary.
### Pretraining
```json
{
"version": 1,
"input": {
"type": "text",
"text_key": "content"
},
"mask": {},
"preprocessing": {
"max_seq_len": 2048,
"min_chars": 50
},
"output": {
"storage_format": "bin"
}
}
```
No mask -- train on all tokens.
### Run
```bash
python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json
```
## Configuration Reference
### `input`
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `type` | string | yes | `"chat"` | Format: `"chat"`, `"instruction"`, or `"text"` |
| `messages_key` | string | no | `"messages"` | JSON key for messages array (chat) |
| `prompt_key` | string | no | `"prompt"` | JSON key for prompt field (instruction) |
| `response_key` | string | no | `"response"` | JSON key for response field (instruction) |
| `text_key` | string | no | `"text"` | JSON key for text field |
### `mask`
A map of `{role_or_field: "mask" | "train"}`. The engine uses this to build `loss_mask`:
- `"mask"` -- tokens in this span are ignored during training (`loss_mask=0`)
- `"train"` -- tokens in this span contribute to the loss (`loss_mask=1`)
For chat mode, keys are role names (`system`, `user`, `assistant`, ...).
For instruction mode, keys are `"prompt"` and `"response"`.
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `mask` | dict | `{}` | Role/field to action mapping |
| `mask_default` | string | `"mask"` | Default action for unlisted roles |
### `preprocessing`
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `max_seq_len` | int | `2048` | Maximum token length; truncated if exceeded |
| `min_chars` | int | `50` | Minimum character length; dropped if shorter (text mode only) |
| `max_chars` | int | `2000000` | Maximum character length; dropped if longer (text mode only) |
| `deduplicate` | bool | `true` | Remove exact duplicates via MD5 of first 200 chars |
| `max_items` | int or null | `null` | Maximum items to process; `null` = unlimited |
### `output`
| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `domain_key` | string or null | `null` | JSON key for domain grouping; `null` = all output to `__default__` |
| `storage_format` | string | `"bin"` | `"bin"` (mmap, zero-copy) or `"h5"` (HDF5) |
| `max_tokens_per_shard` | int | `100000000` | Max tokens per output shard |
## Mask Algorithm
### Chat Mode (role-span tracking)
For each message in the `messages` array:
1. Render through the chat template for that single message
2. Encode the rendered text, record token span `(start, end, role)`
3. Concatenate all spans -- special tokens from the chat template naturally prevent BPE merging across message boundaries
4. Fill `loss_mask` from the mask rules
**Multi-turn example**:
```
Data:
[system: "You are helpful."]
[user: "What is 2+2?"]
[assistant: "4"]
[user: "What is 3+3?"]
[assistant: "6"]
Config:
"mask": {"system": "mask", "user": "mask", "assistant": "train"}
Result:
tokens: <bos> [system span] [user span] [assistant:4 span] [user span] [assistant:6 span]
mask: 0 0 0 1 0 1
```
Both assistant turns are trained. All system and user tokens are masked.
### Instruction Mode (field boundary)
Encode the prompt and response fields independently, then split the mask at the field boundary.
- `"prompt": "mask", "response": "train"` -- mask the left half, train the right half
- `"prompt": "train", "response": "mask"` -- the reverse
### Text Mode (no mask)
Pure tokenization. No `loss_mask` is produced. Used for pretraining.
## Output Layout
```
output_dir/
__default__/ # when domain_key is null
meta.json # {"sequence": {"shape": [N], "dtype": "int64"}, ...}
sequence.bin # int64 raw bytes, mmap-able for zero-copy reads
loss_mask.bin # int64 raw bytes
wiki/ # when domain_key="source" and item["source"]="wiki"
meta.json
sequence.bin
loss_mask.bin
```
## Extension
Register a custom builder for new formats:
```python
from astrai.preprocessing.builder import BaseMaskBuilder, MaskBuilderFactory
@MaskBuilderFactory.register("my_format")
class MyFormatBuilder(BaseMaskBuilder):
def build(self, item: dict, config, tokenizer) -> dict | None:
# Return {"ids": [...], "loss_mask": [...], "domain": "..."}
# Return None to skip this item
...
```
Then set `"input": {"type": "my_format"}` in your config.
## Compared to Old Pipeline
| Old (`astrai.preprocess.Pipeline`) | New (`astrai.preprocessing.pipeline.Pipeline`) |
|---|---|
| Configured via constructor arguments | Configured via JSON file |
| Hardcoded `_transform_chat` / `_transform_text` | Factory-registered `Builder` with declarative mask rules |
| Auto-detects format via magic key lists | Explicit `input.type` declaration |
| Double-encodes (full + prompt), uses length diff for mask | Single-encode with role-span tracking |
| Only trains the last assistant turn | Configurable: multi-turn, single-turn, or no mask |
> Document Update Time: 2026-05-30