229 lines
6.7 KiB
Markdown
229 lines
6.7 KiB
Markdown
# Preprocessing Pipeline
|
|
|
|
Declarative JSON-driven data preprocessing. No code needed -- describe your input format and mask rules in a config file, the engine does the rest.
|
|
|
|
## Philosophy
|
|
|
|
| Component | Responsibility |
|
|
|-----------|---------------|
|
|
| `tokenizer_config.json` (`chat_template`) | Formatting -- how roles become tokens |
|
|
| `pipeline.json` (`mask`) | Masking -- which roles participate in training |
|
|
|
|
The two are fully decoupled. A single config file captures the entire pipeline, reusable and version-controllable. Extension is via factory registration (`@MaskBuilderFactory.register`) -- no need to touch existing code.
|
|
|
|
## Quick Start
|
|
|
|
### SFT Chat
|
|
|
|
```json
|
|
{
|
|
"version": 1,
|
|
"input": {
|
|
"type": "chat",
|
|
"messages_key": "messages"
|
|
},
|
|
"mask": {
|
|
"system": "mask",
|
|
"user": "mask",
|
|
"assistant": "train"
|
|
},
|
|
"mask_default": "mask",
|
|
"preprocessing": {
|
|
"max_seq_len": 2048,
|
|
"deduplicate": true
|
|
},
|
|
"output": {
|
|
"domain_key": "source",
|
|
"storage_format": "bin",
|
|
"max_tokens_per_shard": 100000000
|
|
}
|
|
}
|
|
```
|
|
|
|
Three lines of mask rules cover the most common SFT case: train on assistant turns, mask everything else.
|
|
|
|
### Instruction Tuning
|
|
|
|
```json
|
|
{
|
|
"version": 1,
|
|
"input": {
|
|
"type": "instruction",
|
|
"prompt_key": "instruction",
|
|
"response_key": "output"
|
|
},
|
|
"mask": {
|
|
"prompt": "mask",
|
|
"response": "train"
|
|
},
|
|
"mask_default": "mask",
|
|
"preprocessing": {
|
|
"max_seq_len": 2048
|
|
},
|
|
"output": {
|
|
"storage_format": "bin"
|
|
}
|
|
}
|
|
```
|
|
|
|
Mask splits at the prompt/response field boundary.
|
|
|
|
### Pretraining
|
|
|
|
```json
|
|
{
|
|
"version": 1,
|
|
"input": {
|
|
"type": "text",
|
|
"text_key": "content"
|
|
},
|
|
"mask": {},
|
|
"preprocessing": {
|
|
"max_seq_len": 2048,
|
|
"min_chars": 50
|
|
},
|
|
"output": {
|
|
"storage_format": "bin"
|
|
}
|
|
}
|
|
```
|
|
|
|
No mask -- train on all tokens.
|
|
|
|
### Run
|
|
|
|
```bash
|
|
python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json
|
|
```
|
|
|
|
## Configuration Reference
|
|
|
|
### `input`
|
|
|
|
| Field | Type | Required | Default | Description |
|
|
|-------|------|----------|---------|-------------|
|
|
| `type` | string | yes | `"chat"` | Format: `"chat"`, `"instruction"`, or `"text"` |
|
|
| `messages_key` | string | no | `"messages"` | JSON key for messages array (chat) |
|
|
| `prompt_key` | string | no | `"prompt"` | JSON key for prompt field (instruction) |
|
|
| `response_key` | string | no | `"response"` | JSON key for response field (instruction) |
|
|
| `text_key` | string | no | `"text"` | JSON key for text field |
|
|
|
|
### `mask`
|
|
|
|
A map of `{role_or_field: "mask" | "train"}`. The engine uses this to build `loss_mask`:
|
|
|
|
- `"mask"` -- tokens in this span are ignored during training (`loss_mask=0`)
|
|
- `"train"` -- tokens in this span contribute to the loss (`loss_mask=1`)
|
|
|
|
For chat mode, keys are role names (`system`, `user`, `assistant`, ...).
|
|
For instruction mode, keys are `"prompt"` and `"response"`.
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `mask` | dict | `{}` | Role/field to action mapping |
|
|
| `mask_default` | string | `"mask"` | Default action for unlisted roles |
|
|
|
|
### `preprocessing`
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `max_seq_len` | int | `2048` | Maximum token length; truncated if exceeded |
|
|
| `min_chars` | int | `50` | Minimum character length; dropped if shorter (text mode only) |
|
|
| `max_chars` | int | `2000000` | Maximum character length; dropped if longer (text mode only) |
|
|
| `deduplicate` | bool | `true` | Remove exact duplicates via MD5 of first 200 chars |
|
|
| `max_items` | int or null | `null` | Maximum items to process; `null` = unlimited |
|
|
|
|
### `output`
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `domain_key` | string or null | `null` | JSON key for domain grouping; `null` = all output to `__default__` |
|
|
| `storage_format` | string | `"bin"` | `"bin"` (mmap, zero-copy) or `"h5"` (HDF5) |
|
|
| `max_tokens_per_shard` | int | `100000000` | Max tokens per output shard |
|
|
|
|
## Mask Algorithm
|
|
|
|
### Chat Mode (role-span tracking)
|
|
|
|
For each message in the `messages` array:
|
|
|
|
1. Prepend BOS token (position 0, always masked)
|
|
2. Render through the chat template for that single message
|
|
3. Encode the rendered text, record token span `(start, end, role)`
|
|
4. Concatenate all spans — special tokens from the chat template naturally prevent BPE merging across message boundaries
|
|
5. Fill `loss_mask` from the mask rules
|
|
|
|
**Multi-turn example**:
|
|
|
|
```
|
|
Data:
|
|
[system: "You are helpful."]
|
|
[user: "What is 2+2?"]
|
|
[assistant: "4"]
|
|
[user: "What is 3+3?"]
|
|
[assistant: "6"]
|
|
|
|
Config:
|
|
"mask": {"system": "mask", "user": "mask", "assistant": "train"}
|
|
|
|
Result:
|
|
tokens: <bos> [system span] [user span] [assistant:4 span] [user span] [assistant:6 span]
|
|
mask: 0 0 0 1 0 1
|
|
```
|
|
|
|
Both assistant turns are trained. All system and user tokens are masked.
|
|
|
|
### Instruction Mode (field boundary)
|
|
|
|
Encode the prompt and response fields independently, then split the mask at the field boundary.
|
|
|
|
- `"prompt": "mask", "response": "train"` -- mask the left half, train the right half
|
|
- `"prompt": "train", "response": "mask"` -- the reverse
|
|
|
|
### Text Mode (no mask)
|
|
|
|
Pure tokenization. No `loss_mask` is produced. Used for pretraining.
|
|
|
|
## Output Layout
|
|
|
|
```
|
|
output_dir/
|
|
__default__/ # when domain_key is null
|
|
meta.json # {"sequence": {"shape": [N], "dtype": "int64"}, ...}
|
|
sequence.bin # int64 raw bytes, mmap-able for zero-copy reads
|
|
loss_mask.bin # int64 raw bytes
|
|
wiki/ # when domain_key="source" and item["source"]="wiki"
|
|
meta.json
|
|
sequence.bin
|
|
loss_mask.bin
|
|
```
|
|
|
|
## Extension
|
|
|
|
Register a custom builder for new formats:
|
|
|
|
```python
|
|
from astrai.preprocessing.builder import BaseMaskBuilder, MaskBuilderFactory
|
|
|
|
@MaskBuilderFactory.register("my_format")
|
|
class MyFormatBuilder(BaseMaskBuilder):
|
|
def build(self, item: dict, config, tokenizer) -> dict | None:
|
|
# Return {"ids": [...], "loss_mask": [...], "domain": "..."}
|
|
# Return None to skip this item
|
|
...
|
|
```
|
|
|
|
Then set `"input": {"type": "my_format"}` in your config.
|
|
|
|
## Compared to Old Pipeline
|
|
|
|
| Old (`astrai.preprocess.Pipeline`) | New (`astrai.preprocessing.pipeline.Pipeline`) |
|
|
|---|---|
|
|
| Configured via constructor arguments | Configured via JSON file |
|
|
| Hardcoded `_transform_chat` / `_transform_text` | Factory-registered `Builder` with declarative mask rules |
|
|
| Auto-detects format via magic key lists | Explicit `input.type` declaration |
|
|
| Double-encodes (full + prompt), uses length diff for mask | Single-encode with role-span tracking |
|
|
| Only trains the last assistant turn | Configurable: multi-turn, single-turn, or no mask |
|
|
|
|
> Document Update Time: 2026-05-30
|