347 lines
7.6 KiB
Markdown
347 lines
7.6 KiB
Markdown
# Preprocessing Pipeline
|
|
|
|
Declarative JSON-driven data preprocessing. One `SectionedMaskBuilder` handles all formats via `input.sections` (single-output) or `input.sources` (multi-output).
|
|
|
|
## Philosophy
|
|
|
|
| Component | Responsibility |
|
|
|-----------|---------------|
|
|
| `tokenizer_config.json` (`chat_template`) | Formatting -- how roles become tokens |
|
|
| `pipeline.json` (`mask`) | Masking -- which roles participate in training |
|
|
|
|
A single config file captures the entire pipeline, reusable and version-controllable.
|
|
|
|
## Config Structure
|
|
|
|
```json
|
|
{
|
|
"input": {}, // sections (single) or sources (multi)
|
|
"mask": {}, // role → "train" | "mask"
|
|
"mask_default": "mask",
|
|
"preprocessing": {},
|
|
"output": {}
|
|
}
|
|
```
|
|
|
|
### Section Fields
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `field` | str | -- | JSONL key to read |
|
|
| `action` | str | -- | `"train"` / `"mask"` / `"$role"` |
|
|
| `template` | bool | `false` | Apply `chat_template` per message |
|
|
| `add_special_tokens` | bool | `true` for first non-template section | Add special tokens during encode |
|
|
|
|
### Source Fields (multi-output mode)
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `sections` | list[dict] | -- | Same as single-output section list |
|
|
| `list_field` | bool | `false` | JSONL field holds a list; tokenise each element |
|
|
| `mask_key` | str | `"{key}_mask"` | Explicit output key for loss mask |
|
|
|
|
---
|
|
|
|
## Quick Start
|
|
|
|
### SFT Chat
|
|
|
|
Input JSONL:
|
|
|
|
```json
|
|
{"messages": [{"role": "system", "content": "You are helpful."}, {"role": "user", "content": "Hi"}, {"role": "assistant", "content": "Hello!"}]}
|
|
```
|
|
|
|
Config:
|
|
|
|
```json
|
|
{
|
|
"input": {
|
|
"sections": [
|
|
{"field": "messages", "action": "$role", "template": true}
|
|
]
|
|
},
|
|
"mask": {
|
|
"system": "mask",
|
|
"user": "mask",
|
|
"assistant": "train"
|
|
},
|
|
"mask_default": "mask",
|
|
"preprocessing": {
|
|
"max_seq_len": 2048
|
|
},
|
|
"output": {
|
|
"storage_format": "bin",
|
|
"dtype": {"loss_mask": "bool"}
|
|
}
|
|
}
|
|
```
|
|
|
|
Output keys: `sequence` (int32), `loss_mask` (bool)
|
|
|
|
### SFT Instruction
|
|
|
|
Input JSONL:
|
|
|
|
```json
|
|
{"prompt": "Translate to French: Hello", "response": "Bonjour"}
|
|
```
|
|
|
|
Config:
|
|
|
|
```json
|
|
{
|
|
"input": {
|
|
"sections": [
|
|
{"field": "prompt", "action": "mask", "add_special_tokens": true},
|
|
{"field": "response", "action": "train"}
|
|
]
|
|
},
|
|
"mask_default": "mask",
|
|
"preprocessing": {
|
|
"max_seq_len": 2048
|
|
}
|
|
}
|
|
```
|
|
|
|
Output keys: `sequence`, `loss_mask`
|
|
|
|
### Pretrain
|
|
|
|
Input JSONL:
|
|
|
|
```json
|
|
{"text": "Artificial Intelligence is a field of computer science..."}
|
|
```
|
|
|
|
Config:
|
|
|
|
```json
|
|
{
|
|
"input": {
|
|
"sections": [
|
|
{"field": "text", "action": "train"}
|
|
]
|
|
},
|
|
"preprocessing": {
|
|
"max_seq_len": 8192,
|
|
"min_chars": 100
|
|
}
|
|
}
|
|
```
|
|
|
|
Output keys: `sequence` (no `loss_mask` — all tokens trained)
|
|
|
|
### DPO
|
|
|
|
Input JSONL:
|
|
|
|
```json
|
|
{"chosen": [{"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "4"}], "rejected": [{"role": "user", "content": "What is 2+2?"}, {"role": "assistant", "content": "5"}]}
|
|
```
|
|
|
|
Config:
|
|
|
|
```json
|
|
{
|
|
"input": {
|
|
"sources": {
|
|
"chosen": {
|
|
"sections": [
|
|
{"field": "chosen", "action": "$role", "template": true}
|
|
]
|
|
},
|
|
"rejected": {
|
|
"sections": [
|
|
{"field": "rejected", "action": "$role", "template": true}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
"mask": {
|
|
"user": "mask",
|
|
"assistant": "train"
|
|
},
|
|
"mask_default": "mask"
|
|
}
|
|
```
|
|
|
|
Output keys: `chosen`, `chosen_mask`, `rejected`, `rejected_mask`
|
|
|
|
### GRPO
|
|
|
|
Input JSONL:
|
|
|
|
```json
|
|
{"prompt": [{"role": "user", "content": "What is 2+2?"}], "responses": ["4", "Five", "Four"], "rewards": [1.0, 0.3, 0.8]}
|
|
```
|
|
|
|
Config:
|
|
|
|
```json
|
|
{
|
|
"input": {
|
|
"sources": {
|
|
"prompts": {
|
|
"sections": [
|
|
{"field": "prompt", "action": "mask", "template": true}
|
|
]
|
|
},
|
|
"responses": {
|
|
"sections": [
|
|
{"field": "responses", "action": "train"}
|
|
],
|
|
"list_field": true,
|
|
"mask_key": "masks"
|
|
},
|
|
"rewards": {
|
|
"sections": [
|
|
{"field": "rewards", "action": "value"}
|
|
]
|
|
}
|
|
}
|
|
},
|
|
"mask": {
|
|
"user": "mask",
|
|
"assistant": "train"
|
|
},
|
|
"mask_default": "mask"
|
|
}
|
|
```
|
|
|
|
Output keys: `prompts`, `responses`, `masks`, `rewards` (float32)
|
|
|
|
- `action: "value"` — extract raw values from JSONL without tokenisation
|
|
- `list_field: true` — tokenise each list element independently, then concatenate
|
|
- `mask_key: "masks"` — rename the auto-generated mask key (default: `responses_mask`)
|
|
|
|
---
|
|
|
|
## Configuration Reference
|
|
|
|
### `input`
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `sections` | list[dict] or null | `null` | Section specs for single-output mode |
|
|
| `sources` | dict[str, dict] or null | `null` | Source specs for multi-output mode (DPO/GRPO) |
|
|
|
|
When `sources` is set, `sections` is ignored.
|
|
|
|
### `mask`
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `mask` | dict | `{}` | `{role: "train" \| "mask"}` |
|
|
| `mask_default` | str | `"mask"` | Default action for unlisted roles |
|
|
|
|
### `preprocessing`
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `max_seq_len` | int | `2048` | Truncate sequences to this length |
|
|
| `min_chars` | int | `50` | Skip text-mode items shorter than this |
|
|
| `max_chars` | int | `2000000` | Skip text-mode items longer than this |
|
|
| `max_items` | int or null | `null` | Stop after N documents |
|
|
|
|
### `output`
|
|
|
|
| Field | Type | Default | Description |
|
|
|-------|------|---------|-------------|
|
|
| `domain_key` | str or null | `null` | JSONL key for domain grouping |
|
|
| `storage_format` | str | `"bin"` | `"bin"` (mmap) or `"h5"` |
|
|
| `max_tokens_per_shard` | int | `100000000` | Flush threshold in cumulative tokens |
|
|
| `dtype` | dict[str, str] | `{}` | Per-key tensor dtype override (e.g. `{"loss_mask": "bool"}`) |
|
|
|
|
---
|
|
|
|
## Mask Algorithm
|
|
|
|
### Template mode (`template: true`)
|
|
|
|
For each message in the field's array:
|
|
|
|
1. Prepend BOS token (masked)
|
|
2. Render through `chat_template` for that single message
|
|
3. Encode rendered text
|
|
4. Apply mask rule for the message's role
|
|
|
|
### Non-template mode
|
|
|
|
Encode the field value as text. Mask value is 1 (train) or 0 (mask) per the section's `action`.
|
|
|
|
### Text config detection
|
|
|
|
When no section uses `template` and all sections have `action: "train"`, the builder skips mask generation entirely — all tokens are trained.
|
|
|
|
---
|
|
|
|
## Output Layout
|
|
|
|
### Single-Shard (`bin`)
|
|
|
|
```
|
|
output/
|
|
__default__/
|
|
meta.json
|
|
sequence.bin
|
|
loss_mask.bin
|
|
wiki/
|
|
meta.json
|
|
sequence.bin
|
|
loss_mask.bin
|
|
```
|
|
|
|
### Multi-Shard (`bin`)
|
|
|
|
When `max_tokens_per_shard` is exceeded:
|
|
|
|
```
|
|
output/
|
|
__default__/
|
|
shard_0000/
|
|
meta.json
|
|
sequence.bin
|
|
loss_mask.bin
|
|
shard_0001/
|
|
meta.json
|
|
sequence.bin
|
|
loss_mask.bin
|
|
```
|
|
|
|
`MmapStore` discovers all shards under the domain directory via `rglob("meta.json")`.
|
|
|
|
---
|
|
|
|
## CLI
|
|
|
|
```bash
|
|
# SFT
|
|
python scripts/tools/preprocess.py data/sft/*.jsonl -o output/sft/ -c configs/sft_chat.json
|
|
|
|
# DPO
|
|
python scripts/tools/preprocess.py data/dpo/*.jsonl -o output/dpo/ -c configs/dpo.json --tokenizer_path params
|
|
|
|
# GRPO
|
|
python scripts/tools/preprocess.py data/grpo/*.jsonl -o output/grpo/ -c configs/grpo.json
|
|
```
|
|
|
|
---
|
|
|
|
## Python API
|
|
|
|
```python
|
|
from astrai.preprocessing.pipeline import Pipeline
|
|
from astrai.config.preprocess_config import PipelineConfig
|
|
|
|
config = PipelineConfig.from_json("sft.json")
|
|
Pipeline(
|
|
config,
|
|
["data_part1.jsonl", "data_part2.jsonl"],
|
|
output_dir="output/",
|
|
tokenizer_path="params",
|
|
).run()
|
|
```
|
|
|
|
> Document Update Time: 2026-06-03
|