refactor : BaseConfig 提供 from_json/to_json，嵌套 config 自动反序列化

- from_json/to_json 上提至 BaseConfig，所有子类自动继承 - _coerce 新增 dict 到 BaseConfig 子类的递归反序列化，消除子类 from_dict 重载 - PipelineConfig 等子类仅声明字段，零样板代码 - 测试 tokenizer 改为自包含 BPE（含 chat template），不依赖 params/ 目录 - 特殊 token 改用 ASCII 字符，兼容所有平台
refactor : 基于声明式 JSON 配置的预处理管线重构
2026-05-30 21:04:19 +08:00 · 2026-05-30 20:45:09 +08:00 · 2026-05-30 17:12:42 +08:00 · 2026-05-30 16:51:24 +08:00 · 2026-05-29 21:57:44 +08:00 · 2026-05-29 21:12:52 +08:00
22 changed files with 1329 additions and 95 deletions
--- a/assets/docs/preprocessing.md
+++ b/assets/docs/preprocessing.md
@ -0,0 +1,227 @@
+# Preprocessing Pipeline
+
+Declarative JSON-driven data preprocessing. No code needed -- describe your input format and mask rules in a config file, the engine does the rest.
+
+## Philosophy
+
+| Component | Responsibility |
+|-----------|---------------|
+| `tokenizer_config.json` (`chat_template`) | Formatting -- how roles become tokens |
+| `pipeline.json` (`mask`) | Masking -- which roles participate in training |
+
+The two are fully decoupled. A single config file captures the entire pipeline, reusable and version-controllable. Extension is via factory registration (`@MaskBuilderFactory.register`) -- no need to touch existing code.
+
+## Quick Start
+
+### SFT Chat
+
+```json
+{
+  "version": 1,
+  "input": {
+    "type": "chat",
+    "messages_key": "messages"
+  },
+  "mask": {
+    "system": "mask",
+    "user": "mask",
+    "assistant": "train"
+  },
+  "mask_default": "mask",
+  "preprocessing": {
+    "max_seq_len": 2048,
+    "deduplicate": true
+  },
+  "output": {
+    "domain_key": "source",
+    "storage_format": "bin",
+    "max_tokens_per_shard": 100000000
+  }
+}
+```
+
+Three lines of mask rules cover the most common SFT case: train on assistant turns, mask everything else.
+
+### Instruction Tuning
+
+```json
+{
+  "version": 1,
+  "input": {
+    "type": "instruction",
+    "prompt_key": "instruction",
+    "response_key": "output"
+  },
+  "mask": {
+    "prompt": "mask",
+    "response": "train"
+  },
+  "mask_default": "mask",
+  "preprocessing": {
+    "max_seq_len": 2048
+  },
+  "output": {
+    "storage_format": "bin"
+  }
+}
+```
+
+Mask splits at the prompt/response field boundary.
+
+### Pretraining
+
+```json
+{
+  "version": 1,
+  "input": {
+    "type": "text",
+    "text_key": "content"
+  },
+  "mask": {},
+  "preprocessing": {
+    "max_seq_len": 2048,
+    "min_chars": 50
+  },
+  "output": {
+    "storage_format": "bin"
+  }
+}
+```
+
+No mask -- train on all tokens.
+
+### Run
+
+```bash
+python scripts/tools/preprocess.py data/*.jsonl -o output/ -c sft.json
+```
+
+## Configuration Reference
+
+### `input`
+
+| Field | Type | Required | Default | Description |
+|-------|------|----------|---------|-------------|
+| `type` | string | yes | `"chat"` | Format: `"chat"`, `"instruction"`, or `"text"` |
+| `messages_key` | string | no | `"messages"` | JSON key for messages array (chat) |
+| `prompt_key` | string | no | `"prompt"` | JSON key for prompt field (instruction) |
+| `response_key` | string | no | `"response"` | JSON key for response field (instruction) |
+| `text_key` | string | no | `"text"` | JSON key for text field |
+
+### `mask`
+
+A map of `{role_or_field: "mask" | "train"}`. The engine uses this to build `loss_mask`:
+
+- `"mask"` -- tokens in this span are ignored during training (`loss_mask=0`)
+- `"train"` -- tokens in this span contribute to the loss (`loss_mask=1`)
+
+For chat mode, keys are role names (`system`, `user`, `assistant`, ...).
+For instruction mode, keys are `"prompt"` and `"response"`.
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `mask` | dict | `{}` | Role/field to action mapping |
+| `mask_default` | string | `"mask"` | Default action for unlisted roles |
+
+### `preprocessing`
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `max_seq_len` | int | `2048` | Maximum token length; truncated if exceeded |
+| `min_chars` | int | `50` | Minimum character length; dropped if shorter (text mode only) |
+| `max_chars` | int | `2000000` | Maximum character length; dropped if longer (text mode only) |
+| `deduplicate` | bool | `true` | Remove exact duplicates via MD5 of first 200 chars |
+| `max_items` | int or null | `null` | Maximum items to process; `null` = unlimited |
+
+### `output`
+
+| Field | Type | Default | Description |
+|-------|------|---------|-------------|
+| `domain_key` | string or null | `null` | JSON key for domain grouping; `null` = all output to `__default__` |
+| `storage_format` | string | `"bin"` | `"bin"` (mmap, zero-copy) or `"h5"` (HDF5) |
+| `max_tokens_per_shard` | int | `100000000` | Max tokens per output shard |
+
+## Mask Algorithm
+
+### Chat Mode (role-span tracking)
+
+For each message in the `messages` array:
+
+1. Render through the chat template for that single message
+2. Encode the rendered text, record token span `(start, end, role)`
+3. Concatenate all spans -- special tokens from the chat template naturally prevent BPE merging across message boundaries
+4. Fill `loss_mask` from the mask rules
+
+**Multi-turn example**:
+
+```
+Data:
+  [system: "You are helpful."]
+  [user: "What is 2+2?"]
+  [assistant: "4"]
+  [user: "What is 3+3?"]
+  [assistant: "6"]
+
+Config:
+  "mask": {"system": "mask", "user": "mask", "assistant": "train"}
+
+Result:
+  tokens:  <bos> [system span] [user span] [assistant:4 span] [user span] [assistant:6 span]
+  mask:      0       0            0              1               0             1
+```
+
+Both assistant turns are trained. All system and user tokens are masked.
+
+### Instruction Mode (field boundary)
+
+Encode the prompt and response fields independently, then split the mask at the field boundary.
+
+- `"prompt": "mask", "response": "train"` -- mask the left half, train the right half
+- `"prompt": "train", "response": "mask"` -- the reverse
+
+### Text Mode (no mask)
+
+Pure tokenization. No `loss_mask` is produced. Used for pretraining.
+
+## Output Layout
+
+```
+output_dir/
+  __default__/              # when domain_key is null
+    meta.json               # {"sequence": {"shape": [N], "dtype": "int64"}, ...}
+    sequence.bin            # int64 raw bytes, mmap-able for zero-copy reads
+    loss_mask.bin           # int64 raw bytes
+  wiki/                     # when domain_key="source" and item["source"]="wiki"
+    meta.json
+    sequence.bin
+    loss_mask.bin
+```
+
+## Extension
+
+Register a custom builder for new formats:
+
+```python
+from astrai.preprocessing.builder import BaseMaskBuilder, MaskBuilderFactory
+
+@MaskBuilderFactory.register("my_format")
+class MyFormatBuilder(BaseMaskBuilder):
+    def build(self, item: dict, config, tokenizer) -> dict | None:
+        # Return {"ids": [...], "loss_mask": [...], "domain": "..."}
+        # Return None to skip this item
+        ...
+```
+
+Then set `"input": {"type": "my_format"}` in your config.
+
+## Compared to Old Pipeline
+
+| Old (`astrai.preprocess.Pipeline`) | New (`astrai.preprocessing.pipeline.Pipeline`) |
+|---|---|
+| Configured via constructor arguments | Configured via JSON file |
+| Hardcoded `_transform_chat` / `_transform_text` | Factory-registered `Builder` with declarative mask rules |
+| Auto-detects format via magic key lists | Explicit `input.type` declaration |
+| Double-encodes (full + prompt), uses length diff for mask | Single-encode with role-span tracking |
+| Only trains the last assistant turn | Configurable: multi-turn, single-turn, or no mask |
+
+> Document Update Time: 2026-05-30
--- a/assets/docs/training.md
+++ b/assets/docs/training.md
@ -1,38 +1,5 @@
 # Training

-## Model Architecture
-
-The model uses a decoder-only Transformer with **GQA** (Grouped Query Attention) and optional **MLA** (Multi-head Latent Attention). 1.0 billion parameters, Chinese–English bilingual.
-
-```mermaid
-flowchart TB
-    subgraph Layers["Transformer Layers"]
-        direction TB
-        A[Input Embedding] --> B[Transformer Block\nLayer 1]
-        B --> C[Transformer Block\nLayer ...]
-        C --> D[Transformer Block\nLayer ...]
-        D --> E[RMSNorm]
-        E --> F[Linear]
-        F --> G[SoftMax]
-    end
-
-    subgraph TransformerBlock["Transformer Block"]
-        direction TB
-        H[x] --> I[RMSNorm]
-        I --> J[Linear → Q/K/V]
-        J --> K[Q]; J --> L[K]; J --> M[V]
-        K --> N[RoPE]; L --> O[RoPE]
-        N --> P["Q @ K^T / sqrt(d)"]; O --> P
-        P --> Q[Masked SoftMax]; Q --> R[S @ V]; M --> R
-        R --> S[Linear]; S --> T[+]; H --> T
-        T --> U[RMSNorm]
-        U --> V["Linear (gate)"]; U --> W["Linear (up)"]
-        V --> X[SiLU]; X --> Y[×]; W --> Y
-        Y --> Z["Linear (down)"]; Z --> AA[+]; T --> AA
-        AA --> BB[x']
-    end
-```
-
 ### Autoregression

 Given a token sequence, the model predicts the probability of the next token. Each generated token is appended to the input and fed back, repeating until an end-of-sequence token or max length.
--- a/astrai/init.py
+++ b/astrai/init.py
@ -1,4 +1,4 @@
-__version__ = "1.3.6"
+__version__ = "1.3.7"
 __author__ = "ViperEkura"

 from astrai.config import (
--- a/astrai/config/init.py
+++ b/astrai/config/init.py
@ -4,13 +4,22 @@ from astrai.config.model_config import (
    ConfigFactory,
    EncoderConfig,
 )
+from astrai.config.preprocess_config import (
+    InputConfig,
+    OutputConfig,
+    PipelineConfig,
+    ProcessingConfig,
+)
 from astrai.config.train_config import TrainConfig

 __all__ = [
-    # Model configuration
    "BaseModelConfig",
    "AutoRegressiveLMConfig",
    "EncoderConfig",
    "ConfigFactory",
    "TrainConfig",
+    "InputConfig",
+    "OutputConfig",
+    "PipelineConfig",
+    "ProcessingConfig",
 ]
--- a/astrai/config/base.py
+++ b/astrai/config/base.py
@ -1,6 +1,7 @@
 import json
 from dataclasses import MISSING, dataclass, fields
-from typing import Any, Dict, Optional, Self, get_type_hints
+from pathlib import Path
+from typing import Any, Dict, Optional, Self, Union, get_type_hints


@dataclass
@ -83,4 +84,15 @@ class BaseConfig:
            return value
        if isinstance(value, target_type):
            return value
+        if isinstance(value, dict) and issubclass(target_type, BaseConfig):
+            return target_type.from_dict(value)
        raise TypeError
+
+    @classmethod
+    def from_json(cls, path: Union[str, Path]) -> Self:
+        with open(path, "r", encoding="utf-8") as f:
+            return cls.from_dict(json.load(f))
+
+    def to_json(self, path: Union[str, Path]):
+        with open(path, "w", encoding="utf-8") as f:
+            json.dump(self.to_dict(), f, indent=2, ensure_ascii=False)
--- a/astrai/config/preprocess_config.py
+++ b/astrai/config/preprocess_config.py
@ -0,0 +1,43 @@
+"""Pipeline configuration for JSONL preprocessing."""
+
+from __future__ import annotations
+
+from dataclasses import dataclass, field
+from typing import Dict, Optional
+
+from astrai.config.base import BaseConfig
+
+
+@dataclass
+class InputConfig(BaseConfig):
+    type: str = "chat"
+    messages_key: str = "messages"
+    prompt_key: str = "prompt"
+    response_key: str = "response"
+    text_key: str = "text"
+
+
+@dataclass
+class ProcessingConfig(BaseConfig):
+    max_seq_len: int = 2048
+    min_chars: int = 50
+    max_chars: int = 2_000_000
+    deduplicate: bool = True
+    max_items: Optional[int] = None
+
+
+@dataclass
+class OutputConfig(BaseConfig):
+    domain_key: Optional[str] = None
+    storage_format: str = "bin"
+    max_tokens_per_shard: int = 100_000_000
+
+
+@dataclass
+class PipelineConfig(BaseConfig):
+    version: int = 1
+    input: InputConfig = field(default_factory=InputConfig)
+    mask: Dict[str, str] = field(default_factory=dict)
+    mask_default: str = "mask"
+    preprocessing: ProcessingConfig = field(default_factory=ProcessingConfig)
+    output: OutputConfig = field(default_factory=OutputConfig)
--- a/astrai/inference/api/protocol.py
+++ b/astrai/inference/api/protocol.py
@ -138,13 +138,13 @@ class ProtocolHandler:
            yielded = ""
            matched = None
            async for token in agen:
-                ctx.completion_tokens += 1
                body += token

                matched = checker.check(body)
                if matched:
                    break

+                ctx.completion_tokens += 1
                yield self.builder.format_chunk(token)
                yielded += token

@ -168,7 +168,6 @@ class ProtocolHandler:
        matched = None

        async for token in agen:
-            ctx.completion_tokens += 1
            chunks.append(token)
            body += token

@ -176,6 +175,8 @@ class ProtocolHandler:
            if matched:
                break

+            ctx.completion_tokens += 1
+
        content = "".join(chunks)
        stop = StopInfo(matched=matched, body=body)
        return self.builder.format_response(ctx, content, stop)
--- a/astrai/inference/core/scheduler.py
+++ b/astrai/inference/core/scheduler.py
@ -71,6 +71,7 @@ class InferenceScheduler:
        )

        self._running = False
+        self._fatal_error: Optional[Exception] = None

    def add_task(self, prompt: str, **kwargs) -> str:
        return self._task_mgr.add_task(prompt, **kwargs)
@ -175,6 +176,8 @@ class InferenceScheduler:
                                    t.stream_callback(STOP)

        except Exception as e:
+            self._fatal_error = e
+            self._running = False
            logger.error(f"Scheduler loop crashed: {e}", exc_info=True)
            for task in self._task_mgr.get_active_tasks():
                if task.stream_callback:
@ -184,7 +187,6 @@ class InferenceScheduler:
                if task.stream_callback:
                    task.stream_callback(STOP)
            self._task_mgr.clear_queues()
-            raise

    def start(self):
        if not self._running:
@ -199,7 +201,12 @@ class InferenceScheduler:
        if hasattr(self, "_loop_thread"):
            self._loop_thread.join(timeout=2.0)
        for task in self._task_mgr.get_active_tasks():
+            if task.stream_callback:
+                task.stream_callback(STOP)
            self._page_cache.task_free(task.task_id)
+        for task in self._task_mgr.get_waiting_tasks():
+            if task.stream_callback:
+                task.stream_callback(STOP)
        self._task_mgr.clear_queues()
        if torch.cuda.is_available():
            torch.cuda.empty_cache()
--- a/astrai/inference/core/task.py
+++ b/astrai/inference/core/task.py
@ -186,7 +186,10 @@ class TaskManager:
        return bool(self.active_tasks or self.waiting_queue)

    def wait_for_tasks(self, timeout: float = 1.0):
-        self._task_event.clear()
+        with self._lock:
+            if self.waiting_queue or self.active_tasks:
+                return
+            self._task_event.clear()
        self._task_event.wait(timeout=timeout)

    def get_active_tasks(self) -> List[Task]:
--- a/astrai/inference/engine.py
+++ b/astrai/inference/engine.py
@ -79,8 +79,8 @@ class GenerationRequest:
            raise ValueError("top_k must be a non-negative integer")
        if not (0.0 <= top_p <= 1.0):
            raise ValueError("top_p must be a float between 0.0 and 1.0")
-        if not (isinstance(temperature, (int, float)) and temperature >= 0):
-            raise ValueError("temperature must be a non-negative number")
+        if not (isinstance(temperature, (int, float)) and temperature > 0):
+            raise ValueError("temperature must be a positive number")

        self.messages = messages
        self.top_k = top_k
--- a/astrai/inference/sample.py
+++ b/astrai/inference/sample.py
@ -44,10 +44,12 @@ class TemperatureStrategy(BaseSamplingStrategy):
    def apply(self, logits, filter_value=-float("inf")):
        t = self.temperature
        if isinstance(t, Tensor):
+            t = t.to(logits.device, non_blocking=True).view(-1, 1)
+            t = torch.clamp(t, min=1e-8)
            if (t != 1.0).any():
-                logits = logits / t.to(logits.device, non_blocking=True).view(-1, 1)
+                logits = logits / t
        elif t != 1.0:
-            logits = logits / t
+            logits = logits / max(t, 1e-8)
        return logits


--- a/astrai/parallel/executor.py
+++ b/astrai/parallel/executor.py
@ -7,6 +7,7 @@ from typing import Optional, Tuple

 import torch
 import torch.nn as nn
+from torch.distributed.fsdp import FullStateDictConfig, StateDictType
 from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
 from torch.nn.parallel import DistributedDataParallel as DDP
 from torch.optim import Optimizer
@ -115,8 +116,8 @@ class BaseExecutor:
    def backward(self, loss: torch.Tensor):
        loss.backward()

-    def unwrap_model(self, model: nn.Module) -> nn.Module:
-        return model
+    def unwrap_model(self, model: nn.Module):
+        return model.state_dict()

    @property
    def use_distributed(self) -> bool:
@ -195,10 +196,10 @@ class DDPExecutor(BaseExecutor):
            return model.no_sync()
        return contextlib.nullcontext()

-    def unwrap_model(self, model: nn.Module) -> nn.Module:
+    def unwrap_model(self, model: nn.Module):
        if isinstance(model, DDP):
-            return model.module
-        return model
+            return model.module.state_dict()
+        return model.state_dict()


@ExecutorFactory.register("fsdp")
@ -217,7 +218,6 @@ class FSDPExecutor(BaseExecutor):
        sync_module_states: bool = False,
        forward_prefetch: bool = False,
        limit_all_gathers: bool = True,
-        use_orig_params: bool = False,
        ignored_states=None,
        device_mesh=None,
    ):
@ -236,7 +236,7 @@ class FSDPExecutor(BaseExecutor):
                sync_module_states=sync_module_states,
                forward_prefetch=forward_prefetch,
                limit_all_gathers=limit_all_gathers,
-                use_orig_params=use_orig_params,
+                use_orig_params=True,
                ignored_states=ignored_states,
                device_mesh=device_mesh,
            ).items()
@ -259,9 +259,13 @@ class FSDPExecutor(BaseExecutor):
            return model.no_sync()
        return contextlib.nullcontext()

-    def unwrap_model(self, model: nn.Module) -> nn.Module:
-        if self._original_model is not None:
-            return self._original_model
-        if isinstance(model, FSDP):
-            return model._fsdp_wrapped_module
-        return model
+    def unwrap_model(self, model: nn.Module):
+        if isinstance(model, FSDP) and self.use_distributed:
+            with FSDP.state_dict_type(
+                model,
+                StateDictType.FULL_STATE_DICT,
+                FullStateDictConfig(offload_to_cpu=True, rank0_only=False),
+            ):
+                return model.state_dict()
+
+        return model.state_dict()
--- a/astrai/preprocessing/init.py
+++ b/astrai/preprocessing/init.py
@ -0,0 +1,19 @@
+from astrai.preprocessing.builder import (
+    BaseMaskBuilder,
+    ChatMaskBuilder,
+    InstructionMaskBuilder,
+    MaskBuilderFactory,
+    TextMaskBuilder,
+)
+from astrai.preprocessing.pipeline import Pipeline, dedup_signature, filter_by_length
+
+__all__ = [
+    "BaseMaskBuilder",
+    "ChatMaskBuilder",
+    "InstructionMaskBuilder",
+    "MaskBuilderFactory",
+    "TextMaskBuilder",
+    "Pipeline",
+    "dedup_signature",
+    "filter_by_length",
+]
--- a/astrai/preprocessing/builder.py
+++ b/astrai/preprocessing/builder.py
@ -0,0 +1,161 @@
+"""Mask building strategies for preprocessing pipeline.
+
+Each builder knows how to tokenize one input format and construct
+the loss_mask according to declarative mask rules from the config.
+"""
+
+from __future__ import annotations
+
+from abc import ABC, abstractmethod
+from typing import List, Optional
+
+from astrai.factory import BaseFactory
+
+
+class BaseMaskBuilder(ABC):
+    """Convert a JSONL item into token ids and optional loss_mask."""
+
+    @abstractmethod
+    def build(self, item: dict, config, tokenizer) -> Optional[dict]:
+        """Build ``{ids, loss_mask?, domain}`` from a JSONL record.
+
+        Returns ``None`` to skip the item entirely.
+        """
+        ...
+
+
+class MaskBuilderFactory(BaseFactory["BaseMaskBuilder"]):
+    @classmethod
+    def _validate_component(cls, component_cls: type):
+        if not issubclass(component_cls, BaseMaskBuilder):
+            raise TypeError(
+                f"{component_cls.__name__} must inherit from BaseMaskBuilder"
+            )
+
+
+def _extract_domain(item: dict, domain_key: Optional[str]) -> str:
+    if not domain_key:
+        return "__default__"
+    val = item.get(domain_key, "__default__")
+    return val if isinstance(val, str) else "__default__"
+
+
+@MaskBuilderFactory.register("chat")
+class ChatMaskBuilder(BaseMaskBuilder):
+    """Mask by role via message-level tokenisation with role-span tracking.
+
+    For each message, renders the chat template for that single message,
+    encodes individually, and records its token span + role action.
+    The concatenated sequence receives a loss_mask built from span rules.
+    """
+
+    def build(self, item: dict, config, tokenizer) -> Optional[dict]:
+        messages = item.get(config.input.messages_key)
+        if not isinstance(messages, list) or not messages:
+            return None
+
+        all_ids: List[int] = []
+        spans: List[tuple] = []
+
+        if tokenizer.bos_token_id is not None:
+            all_ids.append(tokenizer.bos_token_id)
+
+        for msg in messages:
+            role = msg.get("role", "")
+            action = config.mask.get(role, config.mask_default)
+
+            rendered = tokenizer.apply_chat_template(
+                [msg], tokenize=False, add_generation_prompt=False
+            )
+            ids = tokenizer.encode(rendered, add_special_tokens=False)
+
+            start = len(all_ids)
+            all_ids.extend(ids)
+            spans.append((start, len(all_ids), action))
+
+        if len(all_ids) <= 1:
+            return None
+
+        max_len = config.preprocessing.max_seq_len
+        all_ids = all_ids[:max_len]
+
+        loss_mask = [0] * len(all_ids)
+        for start, end, action in spans:
+            if start >= len(all_ids):
+                break
+            e = min(end, len(all_ids))
+            if action == "train":
+                loss_mask[start:e] = [1] * (e - start)
+
+        return {
+            "ids": all_ids,
+            "loss_mask": loss_mask,
+            "domain": _extract_domain(item, config.output.domain_key),
+        }
+
+
+@MaskBuilderFactory.register("instruction")
+class InstructionMaskBuilder(BaseMaskBuilder):
+    """Mask by prompt / response field boundary.
+
+    Encodes prompt and response independently, then fills mask
+    according to ``prompt`` / ``response`` entries in the mask config.
+    """
+
+    def build(self, item: dict, config, tokenizer) -> Optional[dict]:
+        prompt = str(item.get(config.input.prompt_key, ""))
+        response = str(item.get(config.input.response_key, ""))
+
+        if not prompt.strip() and not response.strip():
+            return None
+
+        prompt_ids = tokenizer.encode(prompt, add_special_tokens=True)
+        response_ids = tokenizer.encode(response, add_special_tokens=False)
+
+        max_len = config.preprocessing.max_seq_len
+        full_ids = (prompt_ids + response_ids)[:max_len]
+
+        prompt_action = config.mask.get("prompt", config.mask_default)
+        response_action = config.mask.get("response", config.mask_default)
+
+        p_len = min(len(prompt_ids), len(full_ids))
+        r_len = len(full_ids) - p_len
+
+        loss_mask = []
+        if prompt_action == "train":
+            loss_mask += [1] * p_len
+        else:
+            loss_mask += [0] * p_len
+
+        if response_action == "train":
+            loss_mask += [1] * r_len
+        else:
+            loss_mask += [0] * r_len
+
+        return {
+            "ids": full_ids,
+            "loss_mask": loss_mask,
+            "domain": _extract_domain(item, config.output.domain_key),
+        }
+
+
+@MaskBuilderFactory.register("text")
+class TextMaskBuilder(BaseMaskBuilder):
+    """Plain tokenisation — no mask, used for pre-training data."""
+
+    def build(self, item: dict, config, tokenizer) -> Optional[dict]:
+        text = item.get(config.input.text_key, "")
+        if not isinstance(text, str) or not text.strip():
+            return None
+
+        pp = config.preprocessing
+        if not (pp.min_chars <= len(text) <= pp.max_chars):
+            return None
+
+        ids = tokenizer.encode(text, add_special_tokens=True)
+        ids = ids[: pp.max_seq_len]
+
+        return {
+            "ids": ids,
+            "domain": _extract_domain(item, config.output.domain_key),
+        }
--- a/astrai/preprocessing/pipeline.py
+++ b/astrai/preprocessing/pipeline.py
@ -0,0 +1,134 @@
+"""Config-driven JSONL preprocessing pipeline.
+
+Composes a :class:`BaseMaskBuilder` (selected by ``input.type``) with
+deduplication, sharding, and flush to ``.h5`` / ``.bin`` storage.
+"""
+
+from __future__ import annotations
+
+import hashlib
+import json
+import os
+from collections import defaultdict
+from typing import List, Optional
+
+import torch
+import tqdm
+
+from astrai.config.preprocess_config import PipelineConfig
+from astrai.dataset.storage import save_bin, save_h5
+from astrai.preprocessing.builder import MaskBuilderFactory
+from astrai.tokenize import AutoTokenizer
+
+
+def filter_by_length(text: str, min_len: int = 50, max_len: int = 2_000_000) -> bool:
+    return min_len <= len(text) <= max_len
+
+
+def dedup_signature(item: dict) -> str:
+    raw = json.dumps(item, sort_keys=True, ensure_ascii=False)
+    return hashlib.md5(raw[:200].encode()).hexdigest()
+
+
+class Pipeline:
+    """Tokenization pipeline driven by a declarative :class:`PipelineConfig`.
+
+    Usage::
+
+        config = PipelineConfig.from_json("sft_pipeline.json")
+        Pipeline(config, ["data.jsonl"], output_dir="out", tokenizer_path="params").run()
+    """
+
+    def __init__(
+        self,
+        config: PipelineConfig,
+        input_paths: List[str],
+        output_dir: str,
+        tokenizer_path: str,
+    ):
+        os.makedirs(output_dir, exist_ok=True)
+        self.config = config
+        self.paths = input_paths
+        self.output_dir = output_dir
+        self.tokenizer_path = tokenizer_path
+
+        self.mask_builder = MaskBuilderFactory.create(config.input.type)
+
+    def transform(self, item: dict) -> Optional[dict]:
+        return self.mask_builder.build(item, self.config, self._tokenizer)
+
+    def run(self):
+        self._tokenizer = AutoTokenizer.from_pretrained(self.tokenizer_path)
+
+        seen: set = set()
+        domains: dict = defaultdict(lambda: defaultdict(list))
+        total_tokens = 0
+        shard_idx: dict[str, int] = defaultdict(int)
+        count = 0
+
+        pp = self.config.preprocessing
+
+        for item in tqdm.tqdm(
+            self._iter_items(), desc="Tokenizing", unit="docs", mininterval=0.5
+        ):
+            if pp.max_items and count >= pp.max_items:
+                break
+
+            if pp.deduplicate:
+                sig = dedup_signature(item)
+                if sig in seen:
+                    continue
+                seen.add(sig)
+
+            result = self.transform(item)
+            if result is None:
+                continue
+
+            ids = result["ids"]
+            if not ids:
+                continue
+
+            domain = result.get("domain", "__default__")
+            domains[domain]["sequence"].append(ids)
+            if "loss_mask" in result:
+                domains[domain]["loss_mask"].append(result["loss_mask"])
+
+            count += 1
+            total_tokens += len(ids)
+
+            if total_tokens >= self.config.output.max_tokens_per_shard:
+                self._flush(domains, shard_idx)
+                domains.clear()
+                total_tokens = 0
+
+        if total_tokens > 0:
+            self._flush(domains, shard_idx)
+
+        print(f"Done. {count} documents tokenized.")
+
+    def _iter_items(self):
+        for path in self.paths:
+            with open(path, "r", encoding="utf-8") as f:
+                for line in f:
+                    line = line.strip()
+                    if not line:
+                        continue
+                    yield json.loads(line)
+
+    def _flush(self, domains, shard_idx):
+        for domain, keys in domains.items():
+            idx = shard_idx[domain]
+            tensors = {}
+            for key, ids_list in keys.items():
+                tensors[key] = [torch.tensor(sum(ids_list, []), dtype=torch.long)]
+            chunk_dir = os.path.join(self.output_dir, domain)
+            fmt = self.config.output.storage_format
+            if fmt == "bin":
+                save_bin(chunk_dir, tensors)
+            else:
+                save_h5(chunk_dir, f"data_{idx:04d}", tensors)
+            shard_idx[domain] = idx + 1
+            tqdm.tqdm.write(
+                f"  saved {domain}/shard_{idx:04d}  "
+                f"({tensors['sequence'][0].numel():,} tokens)"
+            )
--- a/astrai/trainer/strategy.py
+++ b/astrai/trainer/strategy.py
@ -1,6 +1,5 @@
 """Training strategy implementations with factory pattern."""

-import copy
 from abc import ABC, abstractmethod
 from typing import Any, Callable, Dict, Union

@ -8,28 +7,14 @@ import torch
 import torch.nn as nn
 import torch.nn.functional as F
 from torch import Tensor
-from torch.distributed.fsdp import FullyShardedDataParallel as FSDP
-from torch.nn.parallel import DistributedDataParallel as DDP

 from astrai.factory import BaseFactory


-def unwrap_model(model: nn.Module) -> nn.Module:
-    if isinstance(model, DDP):
-        return model.module
-    if isinstance(model, FSDP):
-        return model._fsdp_wrapped_module
-    return model
-
-
-def create_ref_model(model: nn.Module) -> nn.Module:
-    """Create a reference model for DPO/GRPO training.
-
-    Handles DDP-wrapped models safely by unwrapping first,
-    then creating a deep copy with frozen gradients.
-    """
-    original_model = unwrap_model(model)
-    ref_model = copy.deepcopy(original_model)
+def create_ref_model(model_fn, state_dict: dict) -> nn.Module:
+    """Create a frozen reference model from model_fn + full state dict."""
+    ref_model = model_fn()
+    ref_model.load_state_dict(state_dict)
    ref_model.requires_grad_(False)
    ref_model.eval()
    return ref_model
@ -91,6 +76,8 @@ class BaseStrategy(ABC):
    ):
        self.model = model
        self.device = device
+        self.executor = kwargs.pop("executor", None)
+        self.model_fn = kwargs.pop("model_fn", None)
        self.extra_kwargs = kwargs

    @abstractmethod
@ -230,7 +217,9 @@ class DPOStrategy(BaseStrategy):
        **kwargs,
    ):
        super().__init__(model, device, **kwargs)
-        self.ref_model = create_ref_model(model)
+        self.ref_model = create_ref_model(
+            self.model_fn, self.executor.unwrap_model(model)
+        ).to(device=self.device)
        self.beta = beta
        self.reduction = reduction

@ -284,7 +273,9 @@ class GRPOStrategy(BaseStrategy):
        **kwargs,
    ):
        super().__init__(model, device, **kwargs)
-        self.ref_model = create_ref_model(model)
+        self.ref_model = create_ref_model(
+            self.model_fn, self.executor.unwrap_model(model)
+        ).to(device=self.device)
        self.clip_eps = clip_eps
        self.kl_coef = kl_coef
        self.group_size = group_size
@ -294,8 +285,7 @@ class GRPOStrategy(BaseStrategy):

    def sync_ref_model(self):
        """Copy current model weights to ref model."""
-        ref_state = self.model.state_dict()
-        self.ref_model.load_state_dict(ref_state)
+        self.ref_model.load_state_dict(self.executor.unwrap_model(self.model))

    def compute_loss(self, batch: Dict[str, Tensor]) -> Tensor:
        self._step += 1
--- a/astrai/trainer/train_callback.py
+++ b/astrai/trainer/train_callback.py
@ -146,8 +146,7 @@ class CheckpointCallback(TrainCallback):
        self.last_ckpt_iter = 0

    def _save_checkpoint(self, context: TrainContext):
-        unwrapped = context.executor.unwrap_model(context.model)
-        state_dict = unwrapped.state_dict()
+        state_dict = context.executor.unwrap_model(context.model)
        self.last_ckpt_iter = context.iteration

        if get_rank() == 0:
--- a/astrai/trainer/train_context.py
+++ b/astrai/trainer/train_context.py
@ -162,6 +162,8 @@ class TrainContextBuilder:
            model=context.model,
            train_type=cfg.strategy,
            device=device,
+            executor=executor,
+            model_fn=cfg.model_fn,
            **cfg.extra_kwargs,
        )

--- a/scripts/tools/evaluate_mmlu.py
+++ b/scripts/tools/evaluate_mmlu.py
@ -5,9 +5,9 @@ import csv
 import json
 import os
 import shutil
-import urllib.request
-import zipfile
+import tarfile

+import requests
 import torch
 import torch.nn.functional as F
 import tqdm
@ -15,7 +15,7 @@ import tqdm
 from astrai.model import AutoModel
 from astrai.tokenize import AutoTokenizer

-MMLU_URL = "https://github.com/hendrycks/test/archive/refs/heads/master.zip"
+MMLU_URL = "https://people.eecs.berkeley.edu/~hendrycks/data.tar"
 MMLU_SUBJECTS = [
    "abstract_algebra",
    "anatomy",
@ -78,23 +78,37 @@ MMLU_SUBJECTS = [


 def _download_and_extract(url: str, data_dir: str):
-    zip_path = os.path.join(data_dir, "mmlu.zip")
+    tar_path = os.path.join(data_dir, "data.tar")
    os.makedirs(data_dir, exist_ok=True)
    print(f"Downloading MMLU data from {url}...")
-    urllib.request.urlretrieve(url, zip_path)
+    resp = requests.get(url, stream=True, timeout=300)
+    resp.raise_for_status()
+    total = int(resp.headers.get("content-length", 0))
+    with tqdm.tqdm(total=total, unit="B", unit_scale=True, desc="  Download") as bar:
+        with open(tar_path, "wb") as f:
+            for chunk in resp.iter_content(chunk_size=8192):
+                f.write(chunk)
+                bar.update(len(chunk))
    print("Extracting...")
-    with zipfile.ZipFile(zip_path, "r") as zf:
-        zf.extractall(data_dir)
-    os.remove(zip_path)
+    with tarfile.open(tar_path, "r") as tf:
+        tf.extractall(data_dir)
+    os.remove(tar_path)


 def download_mmlu(data_dir: str):
    _download_and_extract(MMLU_URL, data_dir)
-    src = os.path.join(data_dir, "test-master", "data")
+    src = os.path.join(data_dir, "data")
    if os.path.exists(src):
        for item in os.listdir(src):
-            os.rename(os.path.join(src, item), os.path.join(data_dir, item))
-        shutil.rmtree(os.path.join(data_dir, "test-master"))
+            src_item = os.path.join(src, item)
+            dst_item = os.path.join(data_dir, item)
+            if os.path.exists(dst_item):
+                if os.path.isdir(dst_item):
+                    shutil.rmtree(dst_item)
+                else:
+                    os.remove(dst_item)
+            os.rename(src_item, dst_item)
+        os.rmdir(src)
    print(f"MMLU data saved to {data_dir}")


@ -233,6 +247,7 @@ def main():
    device = args.device
    dtype = getattr(torch, args.dtype)
    model.to(device=device, dtype=dtype)
+    model.eval()

    subjects = args.subjects or MMLU_SUBJECTS
    results = {}
--- a/scripts/tools/preprocess.py
+++ b/scripts/tools/preprocess.py
@ -0,0 +1,38 @@
+"""CLI: JSONL → tokenized .h5/.bin via config-driven Pipeline."""
+
+import argparse
+
+from astrai.config.preprocess_config import PipelineConfig
+from astrai.preprocessing.pipeline import Pipeline
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Raw JSONL → tokenized .h5/.bin via config-driven Pipeline"
+    )
+    parser.add_argument(
+        "inputs", nargs="+", metavar="JSONL", help="One or more JSONL files"
+    )
+    parser.add_argument("--output_dir", "-o", required=True, help="Output directory")
+    parser.add_argument(
+        "--config", "-c", required=True, help="Path to pipeline config JSON"
+    )
+    parser.add_argument(
+        "--tokenizer_path",
+        default="params",
+        help="Path to tokenizer directory (default: params)",
+    )
+    args = parser.parse_args()
+
+    config = PipelineConfig.from_json(args.config)
+
+    Pipeline(
+        config=config,
+        input_paths=args.inputs,
+        output_dir=args.output_dir,
+        tokenizer_path=args.tokenizer_path,
+    ).run()
+
+
+if __name__ == "__main__":
+    main()
--- a/tests/data/test_dataset.py
+++ b/tests/data/test_dataset.py
@ -1,4 +1,3 @@
-import json
 import os

 import numpy as np
@ -8,7 +7,6 @@ import torch
 from astrai.dataset.dataset import DatasetFactory, SEQDataset
 from astrai.dataset.storage import (
    H5Store,
-    MmapStore,
    StoreFactory,
    detect_format,
    load_bin,
--- a/tests/data/test_preprocess.py
+++ b/tests/data/test_preprocess.py
@ -0,0 +1,603 @@
+import json
+import os
+import tempfile
+
+import pytest
+from tokenizers import Tokenizer, models, pre_tokenizers, trainers
+
+from astrai.config.preprocess_config import (
+    InputConfig,
+    OutputConfig,
+    PipelineConfig,
+    ProcessingConfig,
+)
+from astrai.preprocessing.builder import (
+    ChatMaskBuilder,
+    InstructionMaskBuilder,
+    MaskBuilderFactory,
+    TextMaskBuilder,
+)
+from astrai.preprocessing.pipeline import Pipeline, dedup_signature, filter_by_length
+from astrai.tokenize import AutoTokenizer
+
+_SPECIAL_TOKENS = [
+    "<unk>",
+    "<pad>",
+    "<|begin_of_sentence|>",
+    "<|end_of_sentence|>",
+    "<|im_start|>",
+    "<|im_end|>",
+]
+
+_CHAT_TEMPLATE = (
+    "{% for message in messages %}"
+    "{% if message['role'] == 'system' %}"
+    "<|im_start|>system\n{{ message['content'] }}<|im_end|>\n"
+    "{% elif message['role'] == 'user' %}"
+    "<|im_start|>user\n{{ message['content'] }}<|im_end|>\n"
+    "{% elif message['role'] == 'assistant' %}"
+    "<|im_start|>assistant\n{{ message['content'] }}<|im_end|>\n"
+    "{% endif %}"
+    "{% endfor %}"
+    "{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
+)
+
+
+def _build_chat_tokenizer() -> AutoTokenizer:
+    tok = Tokenizer(models.BPE())
+    tok.pre_tokenizer = pre_tokenizers.ByteLevel(add_prefix_space=False)
+    tr = trainers.BpeTrainer(
+        vocab_size=512,
+        min_frequency=1,
+        special_tokens=_SPECIAL_TOKENS,
+    )
+    train_data = [
+        "hello world",
+        "Hi there!",
+        "You are helpful.",
+        "What is 2+2?",
+        "Tell me a story about dragons and knights.",
+        "Sure, here is a tale.",
+        "Translate to French: Hello",
+        "Bonjour",
+        "Artificial Intelligence is a field of computer science.",
+        "system",
+        "user",
+        "assistant",
+        "<|im_start|>",
+        "<|im_end|>",
+        *[chr(i) for i in range(32, 127)],
+    ]
+    tok.train_from_iterator(train_data, tr)
+
+    auto_tok = AutoTokenizer()
+    auto_tok._tokenizer = tok
+    auto_tok._special_token_map = {
+        "bos_token": "<|begin_of_sentence|>",
+        "eos_token": "<|end_of_sentence|>",
+        "pad_token": "<pad>",
+        "unk_token": "<unk>",
+    }
+    auto_tok.set_chat_template(_CHAT_TEMPLATE)
+    return auto_tok
+
+
+@pytest.fixture(scope="session")
+def chat_tokenizer():
+    return _build_chat_tokenizer()
+
+
+@pytest.fixture
+def temp_dir():
+    d = tempfile.mkdtemp()
+    yield d
+    import shutil
+
+    shutil.rmtree(d, ignore_errors=True)
+
+
+def make_chat_config():
+    return PipelineConfig(
+        input=InputConfig(type="chat", messages_key="messages"),
+        mask={"system": "mask", "user": "mask", "assistant": "train"},
+        mask_default="mask",
+        preprocessing=ProcessingConfig(max_seq_len=2048),
+    )
+
+
+def make_instruction_config():
+    return PipelineConfig(
+        input=InputConfig(
+            type="instruction", prompt_key="prompt", response_key="response"
+        ),
+        mask={"prompt": "mask", "response": "train"},
+        mask_default="mask",
+        preprocessing=ProcessingConfig(max_seq_len=2048),
+    )
+
+
+def make_text_config():
+    return PipelineConfig(
+        input=InputConfig(type="text", text_key="text"),
+        preprocessing=ProcessingConfig(
+            max_seq_len=2048, min_chars=1, max_chars=2_000_000
+        ),
+    )
+
+
+class TestPipelineConfig:
+    def test_default_values(self):
+        config = PipelineConfig()
+        assert config.version == 1
+        assert config.input.type == "chat"
+        assert config.mask == {}
+        assert config.mask_default == "mask"
+        assert config.preprocessing.max_seq_len == 2048
+        assert config.output.storage_format == "bin"
+
+    def test_from_dict_flat(self):
+        data = {
+            "version": 1,
+            "input": {"type": "chat", "messages_key": "msgs"},
+            "mask": {"system": "mask", "assistant": "train"},
+            "mask_default": "mask",
+            "preprocessing": {"max_seq_len": 1024},
+            "output": {"storage_format": "h5"},
+        }
+        config = PipelineConfig.from_dict(data)
+        assert config.input.type == "chat"
+        assert config.input.messages_key == "msgs"
+        assert config.mask == {"system": "mask", "assistant": "train"}
+        assert config.preprocessing.max_seq_len == 1024
+        assert config.output.storage_format == "h5"
+
+    def test_to_dict_roundtrip(self):
+        config = PipelineConfig(
+            input=InputConfig(type="instruction", prompt_key="q", response_key="a"),
+            mask={"prompt": "mask", "response": "train"},
+            mask_default="mask",
+        )
+        d = config.to_dict()
+        config2 = PipelineConfig.from_dict(d)
+        assert config2.input.type == "instruction"
+        assert config2.input.prompt_key == "q"
+        assert config2.mask == {"prompt": "mask", "response": "train"}
+
+    def test_to_json_from_json(self, temp_dir):
+        config = PipelineConfig(
+            input=InputConfig(type="text", text_key="body"),
+            mask={"text": "train"},
+            mask_default="mask",
+        )
+        path = os.path.join(temp_dir, "config.json")
+        config.to_json(path)
+        loaded = PipelineConfig.from_json(path)
+        assert loaded.input.type == "text"
+        assert loaded.input.text_key == "body"
+        assert loaded.mask == {"text": "train"}
+
+
+class TestChatMaskBuilder:
+    def test_simple_chat_mask(self, chat_tokenizer):
+        config = make_chat_config()
+        builder = ChatMaskBuilder()
+        item = {
+            "messages": [
+                {"role": "system", "content": "You are helpful."},
+                {"role": "user", "content": "Hello."},
+                {"role": "assistant", "content": "Hi there!"},
+            ]
+        }
+        result = builder.build(item, config, chat_tokenizer)
+        assert result is not None
+        assert "ids" in result
+        assert "loss_mask" in result
+        assert len(result["ids"]) == len(result["loss_mask"])
+
+        ids = chat_tokenizer.decode(result["ids"], skip_special_tokens=False)
+
+        assert "system" in ids.lower() or "<|im_start|>system" in ids
+        assert "assistant" in ids.lower() or "<|im_start|>assistant" in ids
+
+        total = len(result["ids"])
+        trained = sum(result["loss_mask"])
+        assert trained > 0, "At least assistant tokens should be trained"
+        assert trained < total, "System and user tokens should be masked"
+
+    def test_mask_only_assistant_trained(self, chat_tokenizer):
+        config = make_chat_config()
+        builder = ChatMaskBuilder()
+        item = {
+            "messages": [
+                {"role": "user", "content": "What is 2+2?"},
+                {"role": "assistant", "content": "4"},
+            ]
+        }
+        result = builder.build(item, config, chat_tokenizer)
+        mask = result["loss_mask"]
+        ids = result["ids"]
+
+        assert len(ids) == len(mask)
+
+        trained_positions = [i for i, m in enumerate(mask) if m == 1]
+        assert len(trained_positions) > 0, "At least some tokens should be trained"
+
+        masked_positions = [i for i, m in enumerate(mask) if m == 0]
+        assert len(masked_positions) > 0, "User tokens should be masked"
+
+    def test_chat_all_masked(self, chat_tokenizer):
+        config = PipelineConfig(
+            input=InputConfig(type="chat", messages_key="messages"),
+            mask={"system": "mask", "user": "mask", "assistant": "mask"},
+            mask_default="mask",
+            preprocessing=ProcessingConfig(max_seq_len=2048),
+        )
+        builder = ChatMaskBuilder()
+        item = {
+            "messages": [
+                {"role": "system", "content": "You are helpful."},
+                {"role": "assistant", "content": "Hi there!"},
+            ]
+        }
+        result = builder.build(item, config, chat_tokenizer)
+        assert sum(result["loss_mask"]) == 0
+
+    def test_chat_all_trained(self, chat_tokenizer):
+        config = PipelineConfig(
+            input=InputConfig(type="chat", messages_key="messages"),
+            mask={},
+            mask_default="train",
+            preprocessing=ProcessingConfig(max_seq_len=2048),
+        )
+        builder = ChatMaskBuilder()
+        item = {
+            "messages": [
+                {"role": "system", "content": "You are helpful."},
+                {"role": "assistant", "content": "Hi there!"},
+            ]
+        }
+        result = builder.build(item, config, chat_tokenizer)
+        assert sum(result["loss_mask"]) == len(result["ids"]) - 1
+
+    def test_empty_messages_returns_none(self, chat_tokenizer):
+        config = make_chat_config()
+        builder = ChatMaskBuilder()
+        assert builder.build({"messages": []}, config, chat_tokenizer) is None
+        assert builder.build({}, config, chat_tokenizer) is None
+
+    def test_domain_extraction(self, chat_tokenizer):
+        config = PipelineConfig(
+            input=InputConfig(type="chat", messages_key="messages"),
+            mask={"assistant": "train"},
+            mask_default="mask",
+            preprocessing=ProcessingConfig(max_seq_len=2048),
+            output=OutputConfig(domain_key="source"),
+        )
+        builder = ChatMaskBuilder()
+        item = {
+            "messages": [
+                {"role": "user", "content": "Hi"},
+                {"role": "assistant", "content": "Hello"},
+            ],
+            "source": "wiki",
+        }
+        result = builder.build(item, config, chat_tokenizer)
+        assert result["domain"] == "wiki"
+
+    def test_truncation_to_max_len(self, chat_tokenizer):
+        config = PipelineConfig(
+            input=InputConfig(type="chat", messages_key="messages"),
+            mask={"assistant": "train"},
+            mask_default="mask",
+            preprocessing=ProcessingConfig(max_seq_len=10),
+        )
+        builder = ChatMaskBuilder()
+        item = {
+            "messages": [
+                {
+                    "role": "user",
+                    "content": "Tell me a very long story about dragons and knights and magic.",
+                },
+                {"role": "assistant", "content": "Sure! Here is a tale..."},
+            ]
+        }
+        result = builder.build(item, config, chat_tokenizer)
+        assert len(result["ids"]) <= 10
+        assert len(result["loss_mask"]) == len(result["ids"])
+
+
+class TestInstructionMaskBuilder:
+    def test_basic_instruction_mask(self, test_tokenizer):
+        config = make_instruction_config()
+        builder = InstructionMaskBuilder()
+        item = {"prompt": "Translate to French: Hello", "response": "Bonjour"}
+        result = builder.build(item, config, test_tokenizer)
+        assert result is not None
+        assert len(result["ids"]) == len(result["loss_mask"])
+
+    def test_prompt_masked_response_trained(self, test_tokenizer):
+        config = make_instruction_config()
+        builder = InstructionMaskBuilder()
+        item = {"prompt": "hello", "response": "world"}
+        result = builder.build(item, config, test_tokenizer)
+        mask = result["loss_mask"]
+        ids = result["ids"]
+
+        prompt_ids = test_tokenizer.encode("hello", add_special_tokens=True)
+        response_ids = test_tokenizer.encode("world", add_special_tokens=False)
+
+        p_len = min(len(prompt_ids), len(ids))
+        assert all(m == 0 for m in mask[:p_len])
+
+        if p_len < len(ids):
+            assert all(m == 1 for m in mask[p_len:])
+
+    def test_train_on_prompt(self, test_tokenizer):
+        config = PipelineConfig(
+            input=InputConfig(
+                type="instruction", prompt_key="prompt", response_key="response"
+            ),
+            mask={"prompt": "train", "response": "mask"},
+            mask_default="mask",
+            preprocessing=ProcessingConfig(max_seq_len=2048),
+        )
+        builder = InstructionMaskBuilder()
+        item = {"prompt": "hello", "response": "world"}
+        result = builder.build(item, config, test_tokenizer)
+        mask = result["loss_mask"]
+        ids = result["ids"]
+
+        prompt_ids = test_tokenizer.encode("hello", add_special_tokens=True)
+        p_len = min(len(prompt_ids), len(ids))
+        assert all(m == 1 for m in mask[:p_len])
+
+
+class TestTextMaskBuilder:
+    def test_basic_text(self, test_tokenizer):
+        config = make_text_config()
+        builder = TextMaskBuilder()
+        item = {"text": "Hello world. This is a test document."}
+        result = builder.build(item, config, test_tokenizer)
+        assert result is not None
+        assert "ids" in result
+        assert len(result["ids"]) > 0
+        assert "loss_mask" not in result
+
+    def test_empty_text_returns_none(self, test_tokenizer):
+        config = make_text_config()
+        builder = TextMaskBuilder()
+        assert builder.build({"text": ""}, config, test_tokenizer) is None
+        assert builder.build({"text": "   "}, config, test_tokenizer) is None
+
+    def test_too_short_text(self, test_tokenizer):
+        config = PipelineConfig(
+            input=InputConfig(type="text", text_key="text"),
+            preprocessing=ProcessingConfig(min_chars=100),
+        )
+        builder = TextMaskBuilder()
+        assert builder.build({"text": "short"}, config, test_tokenizer) is None
+
+    def test_truncation(self, test_tokenizer):
+        config = PipelineConfig(
+            input=InputConfig(type="text", text_key="text"),
+            preprocessing=ProcessingConfig(max_seq_len=3, min_chars=1),
+        )
+        builder = TextMaskBuilder()
+        item = {"text": "This is a very long text that should be truncated"}
+        result = builder.build(item, config, test_tokenizer)
+        assert len(result["ids"]) <= 3
+
+
+class TestPipeline:
+    def test_full_chat_pipeline(self, temp_dir, chat_tokenizer):
+        tokenizer_dir = os.path.join(temp_dir, "tok")
+        os.makedirs(tokenizer_dir, exist_ok=True)
+        chat_tokenizer._tokenizer.save(os.path.join(tokenizer_dir, "tokenizer.json"))
+        with open(os.path.join(tokenizer_dir, "tokenizer_config.json"), "w") as f:
+            json.dump(
+                {
+                    "special_tokens": {
+                        "bos_token": "<|begin_of_sentence|>",
+                        "eos_token": "<|end_of_sentence|>",
+                        "pad_token": "<pad>",
+                        "unk_token": "<unk>",
+                        "im_start": "<|im_start|>",
+                        "im_end": "<|im_end|>",
+                    },
+                    "chat_template": _CHAT_TEMPLATE,
+                },
+                f,
+            )
+
+        jsonl_path = os.path.join(temp_dir, "chat.jsonl")
+        with open(jsonl_path, "w", encoding="utf-8") as f:
+            f.write(
+                json.dumps(
+                    {
+                        "messages": [
+                            {"role": "system", "content": "You are helpful."},
+                            {"role": "user", "content": "Hi."},
+                            {"role": "assistant", "content": "Hello!"},
+                        ]
+                    }
+                )
+                + "\n"
+            )
+            f.write(
+                json.dumps(
+                    {
+                        "messages": [
+                            {"role": "user", "content": "What is 2+2?"},
+                            {"role": "assistant", "content": "4"},
+                        ]
+                    }
+                )
+                + "\n"
+            )
+
+        config = PipelineConfig(
+            input=InputConfig(type="chat", messages_key="messages"),
+            mask={"system": "mask", "user": "mask", "assistant": "train"},
+            mask_default="mask",
+            preprocessing=ProcessingConfig(max_seq_len=2048, deduplicate=True),
+            output=OutputConfig(storage_format="bin", domain_key=None),
+        )
+
+        out_dir = os.path.join(temp_dir, "output")
+        Pipeline(
+            config=config,
+            input_paths=[jsonl_path],
+            output_dir=out_dir,
+            tokenizer_path=tokenizer_dir,
+        ).run()
+
+        meta_path = os.path.join(out_dir, "__default__", "meta.json")
+        assert os.path.exists(meta_path)
+        with open(meta_path, "r") as f:
+            meta = json.load(f)
+        assert "sequence" in meta
+        assert "loss_mask" in meta
+
+    def test_full_text_pipeline(self, temp_dir, test_tokenizer):
+        import tempfile as tmp
+
+        tokenizer_dir = os.path.join(temp_dir, "tok")
+        os.makedirs(tokenizer_dir, exist_ok=True)
+
+        test_tokenizer._tokenizer.save(os.path.join(tokenizer_dir, "tokenizer.json"))
+        with open(os.path.join(tokenizer_dir, "tokenizer_config.json"), "w") as f:
+            json.dump(
+                {"special_tokens": {"pad_token": "<pad>", "unk_token": "<unk>"}}, f
+            )
+
+        jsonl_path = os.path.join(temp_dir, "text.jsonl")
+        with open(jsonl_path, "w", encoding="utf-8") as f:
+            f.write(
+                json.dumps(
+                    {
+                        "text": "Hello world this is a test document with enough characters to pass the minimum length filter."
+                    }
+                )
+                + "\n"
+            )
+            f.write(
+                json.dumps(
+                    {
+                        "text": "Another document for testing purposes with sufficient length to be processed."
+                    }
+                )
+                + "\n"
+            )
+
+        config = PipelineConfig(
+            input=InputConfig(type="text", text_key="text"),
+            preprocessing=ProcessingConfig(
+                max_seq_len=2048, min_chars=10, deduplicate=True
+            ),
+            output=OutputConfig(storage_format="bin"),
+        )
+
+        out_dir = os.path.join(temp_dir, "output")
+        Pipeline(
+            config=config,
+            input_paths=[jsonl_path],
+            output_dir=out_dir,
+            tokenizer_path=tokenizer_dir,
+        ).run()
+
+        meta_path = os.path.join(out_dir, "__default__", "meta.json")
+        assert os.path.exists(meta_path)
+        with open(meta_path, "r") as f:
+            meta = json.load(f)
+        assert "sequence" in meta
+        assert "loss_mask" not in meta
+
+    def test_full_instruction_pipeline(self, temp_dir, test_tokenizer):
+        tokenizer_dir = os.path.join(temp_dir, "tok")
+        os.makedirs(tokenizer_dir, exist_ok=True)
+        test_tokenizer._tokenizer.save(os.path.join(tokenizer_dir, "tokenizer.json"))
+        with open(os.path.join(tokenizer_dir, "tokenizer_config.json"), "w") as f:
+            json.dump(
+                {"special_tokens": {"pad_token": "<pad>", "unk_token": "<unk>"}}, f
+            )
+
+        jsonl_path = os.path.join(temp_dir, "instruct.jsonl")
+        with open(jsonl_path, "w", encoding="utf-8") as f:
+            f.write(
+                json.dumps(
+                    {
+                        "prompt": "Tell me a joke",
+                        "response": "Why did the chicken cross the road?",
+                    }
+                )
+                + "\n"
+            )
+            f.write(
+                json.dumps(
+                    {
+                        "prompt": "What is AI?",
+                        "response": "Artificial Intelligence is a field of computer science.",
+                    }
+                )
+                + "\n"
+            )
+
+        config = PipelineConfig(
+            input=InputConfig(
+                type="instruction", prompt_key="prompt", response_key="response"
+            ),
+            mask={"prompt": "mask", "response": "train"},
+            mask_default="mask",
+            preprocessing=ProcessingConfig(max_seq_len=2048),
+            output=OutputConfig(storage_format="bin"),
+        )
+
+        out_dir = os.path.join(temp_dir, "output")
+        Pipeline(
+            config=config,
+            input_paths=[jsonl_path],
+            output_dir=out_dir,
+            tokenizer_path=tokenizer_dir,
+        ).run()
+
+        meta_path = os.path.join(out_dir, "__default__", "meta.json")
+        assert os.path.exists(meta_path)
+        with open(meta_path, "r") as f:
+            meta = json.load(f)
+        assert "sequence" in meta
+        assert "loss_mask" in meta
+
+
+class TestUtility:
+    def test_filter_by_length(self):
+        assert filter_by_length("hello world", min_len=5)
+        assert not filter_by_length("hi", min_len=5)
+        assert not filter_by_length("x" * 100, max_len=50)
+        assert filter_by_length("just right", min_len=5, max_len=20)
+
+    def test_dedup_signature(self):
+        a = {"key": "value", "number": 1}
+        b = {"number": 1, "key": "value"}
+        assert dedup_signature(a) == dedup_signature(b)
+        c = {"key": "different"}
+        assert dedup_signature(a) != dedup_signature(c)
+
+
+class TestFactoryRegistration:
+    def test_registered_builders(self):
+        names = MaskBuilderFactory._registry.list_names()
+        assert "chat" in names
+        assert "instruction" in names
+        assert "text" in names
+
+    def test_create_chat_builder(self):
+        builder = MaskBuilderFactory.create("chat")
+        assert isinstance(builder, ChatMaskBuilder)
+
+    def test_create_instruction_builder(self):
+        builder = MaskBuilderFactory.create("instruction")
+        assert isinstance(builder, InstructionMaskBuilder)
+
+    def test_create_text_builder(self):
+        builder = MaskBuilderFactory.create("text")
+        assert isinstance(builder, TextMaskBuilder)
Author	SHA1	Message	Date
ViperEkura	31ae2deeba	refactor : BaseConfig 提供 from_json/to_json，嵌套 config 自动反序列化 - from_json/to_json 上提至 BaseConfig，所有子类自动继承 - _coerce 新增 dict 到 BaseConfig 子类的递归反序列化，消除子类 from_dict 重载 - PipelineConfig 等子类仅声明字段，零样板代码 - 测试 tokenizer 改为自包含 BPE（含 chat template），不依赖 params/ 目录 - 特殊 token 改用 ASCII 字符，兼容所有平台	2026-05-30 21:04:19 +08:00
ViperEkura	69207e2c57	refactor : 基于声明式 JSON 配置的预处理管线重构 - 用工厂注册的 MaskBuilder（chat/instruction/text）替换硬编码的 _transform_* 方法 - mask 规则以 role-to-action 映射声明在配置中，与 chat_template 完全解耦 - 单次编码 + role-span 追踪替代两次编码 + 长度差计算 mask 的方式 - 支持多轮对话训练：所有 assistant 轮次参与训练，而非仅最后一轮 - 新建 astrai.preprocessing 包（builder.py + pipeline.py），删除 astrai/preprocess.py - CLI 精简为 --config 参数，所有参数通过 PipelineConfig JSON 配置 - 新增 PipelineConfig、InputConfig、ProcessingConfig、OutputConfig dataclass - 文档：assets/docs/preprocessing.md - 27 个测试覆盖 mask builder、pipeline、配置序列化、工厂注册	2026-05-30 20:45:09 +08:00
ViperEkura	138c5bcc08	feat : 添加 JSONL 预处理管线 - Pipeline 模板, Reader 加 transform 加 Writer 可组合 - 自动检测 JSONL 格式, 支持 messages 文本 prompt 加 response 三种 - chat 数据通过 apply_chat_template 适配, 自动生成 loss_mask - 输出对齐 Store 和 DatasetFactory, 直接用于训练 - 默认 bin 格式, CLI 入口 scripts/tools/preprocess.py	2026-05-30 17:12:42 +08:00
ViperEkura	a923e0a23a	fix : 修复 MMLU 评测脚本数据源和依赖 - 数据源改为 Berkeley data.tar（GitHub zip 不含数据文件） - urllib 替换为 requests，支持代理下载 - zip 解压替换为 tar，增加目录 flatten 逻辑 - 添加 model.eval() 确保推理模式正确	2026-05-30 16:51:24 +08:00
ViperEkura	f521a30b22	fix : FSDP 优化器顺序、温度除零、调度器静默死亡、ref模型设备 - executor: use_orig_params 硬编码 True，FSDP 不替换 Parameter 对象 - strategy: DPO/GRPO ref 模型创建后移到 device - sample: TemperatureStrategy clamp 1e-8，engine 验证改为 >0 - scheduler: 异常不 re-raise 避免 daemon 静默死亡，stop() 发回调给 waiting 任务	2026-05-29 21:57:44 +08:00
ViperEkura	d4451f6afb	fix : 并行训练 state_dict 收集与训练/推理并发缺陷 - FSDPExecutor: unwrap_model 返回全量 state_dict (state_dict_type FULL)；use_orig_params=True - DDPExecutor/BaseExecutor: unwrap_model 统一返回 model.module.state_dict() / model.state_dict() - CheckpointCallback: 走 executor.unwrap_model 拿完整 state_dict - strategy.py: 移除 FSDP/DDp 依赖；create_ref_model(model_fn, state_dict) 纯函数 - TrainContextBuilder: 传递 model_fn + executor 到 strategy - GRPOStrategy.sync_ref_model: 通过 executor.unwrap_model 获取完整权重 - TaskManager.wait_for_tasks: 锁内检查队列，消除 clear/set 竞态 - ProtocolHandler: stop token 不再计入 completion_tokens（流式/非流式）	2026-05-29 21:12:52 +08:00
ViperEkura	a3275423a4	release : v1.3.7 Features - FSDP parallel backend with zero-redundancy sharded training - LoRA fine-tuning module with low-rank adapter injection and persistence - NTK-Aware RoPE dynamic scaling, extending context window limit - MMLU evaluation script for standardized model knowledge assessment - load_json/load_safetensors broadcast mechanism for cross-node distributed loading Refactors - Storage layer refactored to Store pattern, removed Fetcher layer, supporting multi-segment data with explicit length - Training backend refactored to Executor pattern (none/ddp/fsdp), decoupling parallel logic - Inference protocol layer refactored to Strategy/Builder pattern with independent OpenAI/Anthropic responders - Unified serialization layer, eliminating scattered I/O paths - Removed JSONStore from data pipeline, unified to H5/Bin dual format - Simplified _disable_random_init, moved scheduler into sync block - Removed -> None return annotations, split FSDP parameters Fixes - Disabled DDP static_graph to prevent no_sync/backward conflict under PyTorch 2.7.1 - Checkpoint resume restores optimizer/scheduler state and sampler remaining length - Unwrap DDP/FSDP on checkpoint save to avoid module. prefix - start_epoch/start_batch determined by user args, no longer overridden by checkpoint - Left padding in perplexity.py causing incorrect PPL with batch>1 - Storage multi-segment bug, switched JSON to JSONL - Early abort on task_extend failure after decode, notify waiting tasks on scheduler crash Docs - Synced architecture/training/inference/dataflow/params docs to actual code Tests - Completed inference protocol layer unit test coverage - Added LoRA module tests - Filled storage layer test gaps	2026-05-29 17:46:03 +08:00