Parameter Documentation
Training Parameters
Basic Parameters
| Parameter |
Description |
Default |
--train_type |
Training type (seq, sft, dpo, grpo) |
required |
--data_root_path |
Dataset root directory |
required |
--param_path |
Model parameters or checkpoint path |
required |
--n_epoch |
Total training epochs |
1 |
--batch_per_device |
Batch size per device |
1 |
--grad_accum_steps |
Gradient accumulation steps between optimizer steps |
1 |
Learning Rate Scheduling
| Parameter |
Description |
Default |
--warmup_ratio |
Fraction of total steps used for LR warmup |
0.05 |
--max_lr |
Maximum learning rate (cosine decay after warmup) |
3e-4 |
--max_grad_norm |
Maximum gradient norm for clipping |
1.0 |
Optimizer (AdamW)
| Parameter |
Description |
Default |
--adamw_beta1 |
AdamW beta1 |
0.9 |
--adamw_beta2 |
AdamW beta2 |
0.95 |
--adamw_weight_decay |
AdamW weight decay |
0.01 |
Data Loading
| Parameter |
Description |
Default |
--window_size |
Max input sequence length |
model config max_len |
--stride |
Stride for sliding window over sequences |
None |
--random_seed |
Random seed for reproducibility |
3407 |
--num_workers |
DataLoader worker processes |
4 |
--no_pin_memory |
Disable pin_memory (enabled by default) |
(flag) |
Checkpoint & Resume
| Parameter |
Description |
Default |
--ckpt_interval |
Iterations between checkpoints |
5000 |
--ckpt_dir |
Checkpoint save directory |
checkpoint |
--start_epoch |
Resume from epoch (0 = from scratch) |
0 |
--start_batch |
Resume from batch iteration |
0 |
Distributed Training
| Parameter |
Description |
Default |
--nprocs |
Number of GPUs / processes |
1 |
--parallel_mode |
Parallel strategy (none, ddp, or fsdp) |
none |
--device_type |
Device type |
cuda |
--start_method |
Multiprocessing start method (spawn, fork, forkserver) |
spawn |
Strategy-specific
| Parameter |
Description |
Default |
Used by |
--dpo_beta |
DPO beta value |
0.1 |
dpo |
--label_smoothing |
Label smoothing for cross-entropy loss |
0.05 |
seq, sft |
--group_size |
GRPO group size |
4 |
grpo |
--grpo_clip_eps |
GRPO clipping epsilon |
0.2 |
grpo |
--grpo_kl_coef |
GRPO KL penalty coefficient |
0.01 |
grpo |
--grpo_sync_interval |
GRPO ref_model sync interval (steps) |
200 |
grpo |
Usage Example
export CUDA_VISIBLE_DEVICES=0,1,2,3
nohup python scripts/tools/train.py \
--nprocs=4 \
--train_type=seq \
--data_root_path=/path/to/dataset \
--param_path=/path/to/model \
--batch_per_device=4 \
--grad_accum_steps=8 \
--warmup_ratio=0.05 \
--max_lr=1e-4 \
--max_grad_norm=1.0 \
--adamw_beta1=0.9 \
--adamw_beta2=0.95 \
--adamw_weight_decay=0.01 \
--window_size=2048 \
--ckpt_interval=10000 \
--ckpt_dir=./checkpoint \
--random_seed=3407 \
--label_smoothing=0.05 \
> out.log 2> err.log &
Document Update Time: 2026-05-24