7.8 KiB
Preprocessing Pipeline
Declarative JSON-driven data preprocessing. No code needed -- describe your input format and mask rules in a config file, the engine does the rest.
Philosophy
| Component | Responsibility |
|---|---|
tokenizer_config.json (chat_template) |
Formatting -- how roles become tokens |
pipeline.json (mask) |
Masking -- which roles participate in training |
The two are fully decoupled. A single config file captures the entire pipeline, reusable and version-controllable. Extension is via factory registration (@MaskBuilderFactory.register) -- no need to touch existing code.
Quick Start
SFT Chat
{
"version": 1,
"input": {
"type": "chat",
"messages_key": "messages"
},
"mask": {
"system": "mask",
"user": "mask",
"assistant": "train"
},
"mask_default": "mask",
"preprocessing": {
"max_seq_len": 2048,
"deduplicate": true
},
"output": {
"domain_key": "source",
"storage_format": "bin",
"max_tokens_per_shard": 100000000
}
}
Three lines of mask rules cover the most common SFT case: train on assistant turns, mask everything else.
Instruction Tuning
{
"version": 1,
"input": {
"type": "instruction",
"prompt_key": "instruction",
"response_key": "output"
},
"mask": {
"prompt": "mask",
"response": "train"
},
"mask_default": "mask",
"preprocessing": {
"max_seq_len": 2048
},
"output": {
"storage_format": "bin"
}
}
Mask splits at the prompt/response field boundary.
Pretraining
{
"version": 1,
"input": {
"type": "text",
"text_key": "content"
},
"mask": {},
"preprocessing": {
"max_seq_len": 2048,
"min_chars": 50
},
"output": {
"storage_format": "bin"
}
}
No mask -- train on all tokens.
Run
python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json
Configuration Reference
input
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
type |
string | yes | "chat" |
Format: "chat", "instruction", or "text" |
messages_key |
string | no | "messages" |
JSON key for messages array (chat) |
prompt_key |
string | no | "prompt" |
JSON key for prompt field (instruction) |
response_key |
string | no | "response" |
JSON key for response field (instruction) |
text_key |
string | no | "text" |
JSON key for text field |
mask
A map of {role_or_field: "mask" | "train"}. The engine uses this to build loss_mask:
"mask"-- tokens in this span are ignored during training (loss_mask=0)"train"-- tokens in this span contribute to the loss (loss_mask=1)
For chat mode, keys are role names (system, user, assistant, ...).
For instruction mode, keys are "prompt" and "response".
| Field | Type | Default | Description |
|---|---|---|---|
mask |
dict | {} |
Role/field to action mapping |
mask_default |
string | "mask" |
Default action for unlisted roles |
preprocessing
| Field | Type | Default | Description |
|---|---|---|---|
max_seq_len |
int | 2048 |
Maximum token length; truncated if exceeded |
min_chars |
int | 50 |
Minimum character length; dropped if shorter (text mode only) |
max_chars |
int | 2000000 |
Maximum character length; dropped if longer (text mode only) |
deduplicate |
bool | true |
Remove exact duplicates via MD5 of first 200 chars |
max_items |
int or null | null |
Maximum items to process; null = unlimited |
output
| Field | Type | Default | Description |
|---|---|---|---|
domain_key |
string or null | null |
JSON key for domain grouping; null = all output to __default__ |
storage_format |
string | "bin" |
"bin" (mmap, zero-copy) or "h5" (HDF5) |
max_tokens_per_shard |
int | 100000000 |
Max tokens per output shard |
Mask Algorithm
Chat Mode (role-span tracking)
For each message in the messages array:
- Prepend BOS token (position 0, always masked)
- Render through the chat template for that single message
- Encode the rendered text, record token span
(start, end, role) - Concatenate all spans — special tokens from the chat template naturally prevent BPE merging across message boundaries
- Fill
loss_maskfrom the mask rules
Multi-turn example:
Data:
[system: "You are helpful."]
[user: "What is 2+2?"]
[assistant: "4"]
[user: "What is 3+3?"]
[assistant: "6"]
Config:
"mask": {"system": "mask", "user": "mask", "assistant": "train"}
Result:
tokens: <bos> [system span] [user span] [assistant:4 span] [user span] [assistant:6 span]
mask: 0 0 0 1 0 1
Both assistant turns are trained. All system and user tokens are masked.
Instruction Mode (field boundary)
Encode the prompt and response fields independently, then split the mask at the field boundary.
"prompt": "mask", "response": "train"-- mask the left half, train the right half"prompt": "train", "response": "mask"-- the reverse
Text Mode (no mask)
Pure tokenization. No loss_mask is produced. Used for pretraining.
Output Layout
Single-Shard (bin)
output_dir/
__default__/ # when domain_key is null
meta.json # {"sequence": {"shape": [N], "dtype": "int64"}, ...}
sequence.bin # int64 raw bytes, mmap-able for zero-copy reads
loss_mask.bin # int64 raw bytes
wiki/ # when domain_key="source" and item["source"]="wiki"
meta.json
sequence.bin
loss_mask.bin
Multi-Shard (bin)
When max_tokens_per_shard is exceeded, bin output is split into numbered shard subdirectories:
output_dir/
__default__/
shard_0000/
meta.json
sequence.bin
loss_mask.bin
shard_0001/
meta.json
sequence.bin
loss_mask.bin
MmapStore automatically discovers and merges all shards under the domain directory.
H5 Output
HDF5 files are always named with a shard index, avoiding overwrite regardless of max_tokens_per_shard:
output_dir/
__default__/
data_0000.h5 # each H5 contains key→dataset groups
data_0001.h5
wiki/
data_0000.h5
Python API Usage
from astrai.preprocessing.pipeline import Pipeline
from astrai.config.preprocess_config import PipelineConfig
config = PipelineConfig.from_json("sft_pipeline.json")
Pipeline(
config,
["data_part1.jsonl", "data_part2.jsonl"],
output_dir="output/",
tokenizer_path="params"
).run()
Or from the CLI:
python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json
Extension
Register a custom builder for new formats:
from astrai.preprocessing.builder import BaseMaskBuilder, MaskBuilderFactory
@MaskBuilderFactory.register("my_format")
class MyFormatBuilder(BaseMaskBuilder):
def build(self, item: dict, config, tokenizer) -> dict | None:
# Return {"ids": [...], "loss_mask": [...], "domain": "..."}
# Return None to skip this item
...
Then set "input": {"type": "my_format"} in your config.
Compared to Old Pipeline
Old (astrai.preprocess.Pipeline) |
New (astrai.preprocessing.pipeline.Pipeline) |
|---|---|
| Configured via constructor arguments | Configured via JSON file |
Hardcoded _transform_chat / _transform_text |
Factory-registered Builder with declarative mask rules |
| Auto-detects format via magic key lists | Explicit input.type declaration |
| Double-encodes (full + prompt), uses length diff for mask | Single-encode with role-span tracking |
| Only trains the last assistant turn | Configurable: multi-turn, single-turn, or no mask |
Document Update Time: 2026-05-30