7.8 KiB

Raw Blame History

Preprocessing Pipeline

Declarative JSON-driven data preprocessing. No code needed -- describe your input format and mask rules in a config file, the engine does the rest.

Philosophy

Component	Responsibility
`tokenizer_config.json` (`chat_template`)	Formatting -- how roles become tokens
`pipeline.json` (`mask`)	Masking -- which roles participate in training

The two are fully decoupled. A single config file captures the entire pipeline, reusable and version-controllable. Extension is via factory registration (@MaskBuilderFactory.register) -- no need to touch existing code.

Quick Start

SFT Chat

{
  "version": 1,
  "input": {
    "type": "chat",
    "messages_key": "messages"
  },
  "mask": {
    "system": "mask",
    "user": "mask",
    "assistant": "train"
  },
  "mask_default": "mask",
  "preprocessing": {
    "max_seq_len": 2048,
    "deduplicate": true
  },
  "output": {
    "domain_key": "source",
    "storage_format": "bin",
    "max_tokens_per_shard": 100000000
  }
}

Three lines of mask rules cover the most common SFT case: train on assistant turns, mask everything else.

Instruction Tuning

{
  "version": 1,
  "input": {
    "type": "instruction",
    "prompt_key": "instruction",
    "response_key": "output"
  },
  "mask": {
    "prompt": "mask",
    "response": "train"
  },
  "mask_default": "mask",
  "preprocessing": {
    "max_seq_len": 2048
  },
  "output": {
    "storage_format": "bin"
  }
}

Mask splits at the prompt/response field boundary.

Pretraining

{
  "version": 1,
  "input": {
    "type": "text",
    "text_key": "content"
  },
  "mask": {},
  "preprocessing": {
    "max_seq_len": 2048,
    "min_chars": 50
  },
  "output": {
    "storage_format": "bin"
  }
}

No mask -- train on all tokens.

Run

python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json

Configuration Reference

`input`

Field	Type	Required	Default	Description
`type`	string	yes	`"chat"`	Format: `"chat"`, `"instruction"`, or `"text"`
`messages_key`	string	no	`"messages"`	JSON key for messages array (chat)
`prompt_key`	string	no	`"prompt"`	JSON key for prompt field (instruction)
`response_key`	string	no	`"response"`	JSON key for response field (instruction)
`text_key`	string	no	`"text"`	JSON key for text field

`mask`

A map of {role_or_field: "mask" | "train"}. The engine uses this to build loss_mask:

"mask" -- tokens in this span are ignored during training (loss_mask=0)
"train" -- tokens in this span contribute to the loss (loss_mask=1)

For chat mode, keys are role names (system, user, assistant, ...). For instruction mode, keys are "prompt" and "response".

Field	Type	Default	Description
`mask`	dict	`{}`	Role/field to action mapping
`mask_default`	string	`"mask"`	Default action for unlisted roles

`preprocessing`

Field	Type	Default	Description
`max_seq_len`	int	`2048`	Maximum token length; truncated if exceeded
`min_chars`	int	`50`	Minimum character length; dropped if shorter (text mode only)
`max_chars`	int	`2000000`	Maximum character length; dropped if longer (text mode only)
`deduplicate`	bool	`true`	Remove exact duplicates via MD5 of first 200 chars
`max_items`	int or null	`null`	Maximum items to process; `null` = unlimited

`output`

Field	Type	Default	Description
`domain_key`	string or null	`null`	JSON key for domain grouping; `null` = all output to `__default__`
`storage_format`	string	`"bin"`	`"bin"` (mmap, zero-copy) or `"h5"` (HDF5)
`max_tokens_per_shard`	int	`100000000`	Max tokens per output shard

Mask Algorithm

Chat Mode (role-span tracking)

For each message in the messages array:

Prepend BOS token (position 0, always masked)
Render through the chat template for that single message
Encode the rendered text, record token span (start, end, role)
Concatenate all spans — special tokens from the chat template naturally prevent BPE merging across message boundaries
Fill loss_mask from the mask rules

Multi-turn example:

Data:
  [system: "You are helpful."]
  [user: "What is 2+2?"]
  [assistant: "4"]
  [user: "What is 3+3?"]
  [assistant: "6"]

Config:
  "mask": {"system": "mask", "user": "mask", "assistant": "train"}

Result:
  tokens:  <bos> [system span] [user span] [assistant:4 span] [user span] [assistant:6 span]
  mask:      0       0            0              1               0             1

Both assistant turns are trained. All system and user tokens are masked.

Instruction Mode (field boundary)

Encode the prompt and response fields independently, then split the mask at the field boundary.

"prompt": "mask", "response": "train" -- mask the left half, train the right half
"prompt": "train", "response": "mask" -- the reverse

Text Mode (no mask)

Pure tokenization. No loss_mask is produced. Used for pretraining.

Output Layout

Single-Shard (`bin`)

output_dir/
  __default__/              # when domain_key is null
    meta.json               # {"sequence": {"shape": [N], "dtype": "int64"}, ...}
    sequence.bin            # int64 raw bytes, mmap-able for zero-copy reads
    loss_mask.bin           # int64 raw bytes
  wiki/                     # when domain_key="source" and item["source"]="wiki"
    meta.json
    sequence.bin
    loss_mask.bin

Multi-Shard (`bin`)

When max_tokens_per_shard is exceeded, bin output is split into numbered shard subdirectories:

output_dir/
  __default__/
    shard_0000/
      meta.json
      sequence.bin
      loss_mask.bin
    shard_0001/
      meta.json
      sequence.bin
      loss_mask.bin

MmapStore automatically discovers and merges all shards under the domain directory.

H5 Output

HDF5 files are always named with a shard index, avoiding overwrite regardless of max_tokens_per_shard:

output_dir/
  __default__/
    data_0000.h5            # each H5 contains key→dataset groups
    data_0001.h5
  wiki/
    data_0000.h5

Python API Usage

from astrai.preprocessing.pipeline import Pipeline
from astrai.config.preprocess_config import PipelineConfig

config = PipelineConfig.from_json("sft_pipeline.json")
Pipeline(
    config,
    ["data_part1.jsonl", "data_part2.jsonl"],
    output_dir="output/",
    tokenizer_path="params"
).run()

Or from the CLI:

python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json

Extension

from astrai.preprocessing.builder import BaseMaskBuilder, MaskBuilderFactory

@MaskBuilderFactory.register("my_format")
class MyFormatBuilder(BaseMaskBuilder):
    def build(self, item: dict, config, tokenizer) -> dict | None:
        # Return {"ids": [...], "loss_mask": [...], "domain": "..."}
        # Return None to skip this item
        ...

Then set "input": {"type": "my_format"} in your config.

Compared to Old Pipeline

Old (`astrai.preprocess.Pipeline`)	New (`astrai.preprocessing.pipeline.Pipeline`)
Configured via constructor arguments	Configured via JSON file
Hardcoded `_transform_chat` / `_transform_text`	Factory-registered `Builder` with declarative mask rules
Auto-detects format via magic key lists	Explicit `input.type` declaration
Double-encodes (full + prompt), uses length diff for mask	Single-encode with role-span tracking
Only trains the last assistant turn	Configurable: multi-turn, single-turn, or no mask

Document Update Time: 2026-05-30

7.8 KiB Raw Blame History