3.3 KiB

Raw Blame History

Parameter Documentation

Training Parameters

Basic Parameters

Parameter	Description	Default
`--train_type`	Training type (`seq`, `sft`, `dpo`, `grpo`)	required
`--data_root_path`	Dataset root directory	required
`--param_path`	Model parameters or checkpoint path	required
`--n_epoch`	Total training epochs	1
`--batch_per_device`	Batch size per device	1
`--grad_accum_steps`	Gradient accumulation steps between optimizer steps	1

Learning Rate Scheduling

Parameter	Description	Default
`--warmup_ratio`	Fraction of total steps used for LR warmup	0.05
`--max_lr`	Maximum learning rate (cosine decay after warmup)	3e-4
`--max_grad_norm`	Maximum gradient norm for clipping	1.0

Optimizer (AdamW)

Parameter	Description	Default
`--adamw_beta1`	AdamW beta1	0.9
`--adamw_beta2`	AdamW beta2	0.95
`--adamw_weight_decay`	AdamW weight decay	0.01

Data Loading

Parameter	Description	Default
`--window_size`	Max input sequence length	model config `max_len`
`--stride`	Stride for sliding window over sequences	None
`--random_seed`	Random seed for reproducibility	3407
`--num_workers`	DataLoader worker processes	4
`--no_pin_memory`	Disable pin_memory (enabled by default)	(flag)

Checkpoint & Resume

Parameter	Description	Default
`--ckpt_interval`	Iterations between checkpoints	5000
`--ckpt_dir`	Checkpoint save directory	checkpoint
`--start_epoch`	Resume from epoch (0 = from scratch)	0
`--start_batch`	Resume from batch iteration	0

Distributed Training

Parameter	Description	Default
`--nprocs`	Number of GPUs / processes	1
`--parallel_mode`	Parallel strategy (`none`, `ddp`, or `fsdp`)	none
`--device_type`	Device type	cuda
`--start_method`	Multiprocessing start method (`spawn`, `fork`, `forkserver`)	spawn

Strategy-specific

Parameter	Description	Default	Used by
`--dpo_beta`	DPO beta value	0.1	`dpo`
`--label_smoothing`	Label smoothing for cross-entropy loss	0.05	`seq`, `sft`
`--group_size`	GRPO group size	4	`grpo`
`--grpo_clip_eps`	GRPO clipping epsilon	0.2	`grpo`
`--grpo_kl_coef`	GRPO KL penalty coefficient	0.01	`grpo`
`--grpo_sync_interval`	GRPO ref_model sync interval (steps)	200	`grpo`

Usage Example

export CUDA_VISIBLE_DEVICES=0,1,2,3

nohup python scripts/tools/train.py \
    --nprocs=4 \
    --train_type=seq \
    --data_root_path=/path/to/dataset \
    --param_path=/path/to/model \
    --batch_per_device=4 \
    --grad_accum_steps=8 \
    --warmup_ratio=0.05 \
    --max_lr=1e-4 \
    --max_grad_norm=1.0 \
    --adamw_beta1=0.9 \
    --adamw_beta2=0.95 \
    --adamw_weight_decay=0.01 \
    --window_size=2048 \
    --ckpt_interval=10000 \
    --ckpt_dir=./checkpoint \
    --random_seed=3407 \
    --label_smoothing=0.05 \
    > out.log 2> err.log &

Document Update Time: 2026-05-24

3.3 KiB Raw Blame History

Parameter Documentation

Training Parameters

Basic Parameters

Learning Rate Scheduling

Optimizer (AdamW)

Data Loading

Checkpoint & Resume

Distributed Training

Strategy-specific

Usage Example

3.3 KiB

Raw Blame History