AstrAI/assets/docs/preprocessing.md

6.7 KiB

Preprocessing Pipeline

Declarative JSON-driven data preprocessing. No code needed -- describe your input format and mask rules in a config file, the engine does the rest.

Philosophy

Component Responsibility
tokenizer_config.json (chat_template) Formatting -- how roles become tokens
pipeline.json (mask) Masking -- which roles participate in training

The two are fully decoupled. A single config file captures the entire pipeline, reusable and version-controllable. Extension is via factory registration (@MaskBuilderFactory.register) -- no need to touch existing code.

Quick Start

SFT Chat

{
  "version": 1,
  "input": {
    "type": "chat",
    "messages_key": "messages"
  },
  "mask": {
    "system": "mask",
    "user": "mask",
    "assistant": "train"
  },
  "mask_default": "mask",
  "preprocessing": {
    "max_seq_len": 2048,
    "deduplicate": true
  },
  "output": {
    "domain_key": "source",
    "storage_format": "bin",
    "max_tokens_per_shard": 100000000
  }
}

Three lines of mask rules cover the most common SFT case: train on assistant turns, mask everything else.

Instruction Tuning

{
  "version": 1,
  "input": {
    "type": "instruction",
    "prompt_key": "instruction",
    "response_key": "output"
  },
  "mask": {
    "prompt": "mask",
    "response": "train"
  },
  "mask_default": "mask",
  "preprocessing": {
    "max_seq_len": 2048
  },
  "output": {
    "storage_format": "bin"
  }
}

Mask splits at the prompt/response field boundary.

Pretraining

{
  "version": 1,
  "input": {
    "type": "text",
    "text_key": "content"
  },
  "mask": {},
  "preprocessing": {
    "max_seq_len": 2048,
    "min_chars": 50
  },
  "output": {
    "storage_format": "bin"
  }
}

No mask -- train on all tokens.

Run

python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json

Configuration Reference

input

Field Type Required Default Description
type string yes "chat" Format: "chat", "instruction", or "text"
messages_key string no "messages" JSON key for messages array (chat)
prompt_key string no "prompt" JSON key for prompt field (instruction)
response_key string no "response" JSON key for response field (instruction)
text_key string no "text" JSON key for text field

mask

A map of {role_or_field: "mask" | "train"}. The engine uses this to build loss_mask:

  • "mask" -- tokens in this span are ignored during training (loss_mask=0)
  • "train" -- tokens in this span contribute to the loss (loss_mask=1)

For chat mode, keys are role names (system, user, assistant, ...). For instruction mode, keys are "prompt" and "response".

Field Type Default Description
mask dict {} Role/field to action mapping
mask_default string "mask" Default action for unlisted roles

preprocessing

Field Type Default Description
max_seq_len int 2048 Maximum token length; truncated if exceeded
min_chars int 50 Minimum character length; dropped if shorter (text mode only)
max_chars int 2000000 Maximum character length; dropped if longer (text mode only)
deduplicate bool true Remove exact duplicates via MD5 of first 200 chars
max_items int or null null Maximum items to process; null = unlimited

output

Field Type Default Description
domain_key string or null null JSON key for domain grouping; null = all output to __default__
storage_format string "bin" "bin" (mmap, zero-copy) or "h5" (HDF5)
max_tokens_per_shard int 100000000 Max tokens per output shard

Mask Algorithm

Chat Mode (role-span tracking)

For each message in the messages array:

  1. Prepend BOS token (position 0, always masked)
  2. Render through the chat template for that single message
  3. Encode the rendered text, record token span (start, end, role)
  4. Concatenate all spans — special tokens from the chat template naturally prevent BPE merging across message boundaries
  5. Fill loss_mask from the mask rules

Multi-turn example:

Data:
  [system: "You are helpful."]
  [user: "What is 2+2?"]
  [assistant: "4"]
  [user: "What is 3+3?"]
  [assistant: "6"]

Config:
  "mask": {"system": "mask", "user": "mask", "assistant": "train"}

Result:
  tokens:  <bos> [system span] [user span] [assistant:4 span] [user span] [assistant:6 span]
  mask:      0       0            0              1               0             1

Both assistant turns are trained. All system and user tokens are masked.

Instruction Mode (field boundary)

Encode the prompt and response fields independently, then split the mask at the field boundary.

  • "prompt": "mask", "response": "train" -- mask the left half, train the right half
  • "prompt": "train", "response": "mask" -- the reverse

Text Mode (no mask)

Pure tokenization. No loss_mask is produced. Used for pretraining.

Output Layout

output_dir/
  __default__/              # when domain_key is null
    meta.json               # {"sequence": {"shape": [N], "dtype": "int64"}, ...}
    sequence.bin            # int64 raw bytes, mmap-able for zero-copy reads
    loss_mask.bin           # int64 raw bytes
  wiki/                     # when domain_key="source" and item["source"]="wiki"
    meta.json
    sequence.bin
    loss_mask.bin

Extension

Register a custom builder for new formats:

from astrai.preprocessing.builder import BaseMaskBuilder, MaskBuilderFactory

@MaskBuilderFactory.register("my_format")
class MyFormatBuilder(BaseMaskBuilder):
    def build(self, item: dict, config, tokenizer) -> dict | None:
        # Return {"ids": [...], "loss_mask": [...], "domain": "..."}
        # Return None to skip this item
        ...

Then set "input": {"type": "my_format"} in your config.

Compared to Old Pipeline

Old (astrai.preprocess.Pipeline) New (astrai.preprocessing.pipeline.Pipeline)
Configured via constructor arguments Configured via JSON file
Hardcoded _transform_chat / _transform_text Factory-registered Builder with declarative mask rules
Auto-detects format via magic key lists Explicit input.type declaration
Double-encodes (full + prompt), uses length diff for mask Single-encode with role-span tracking
Only trains the last assistant turn Configurable: multi-turn, single-turn, or no mask

Document Update Time: 2026-05-30