modelscope · hjh0119 · Jun 23, 2026 · Jun 25, 2026 · Jun 25, 2026 · Jun 25, 2026
diff --git a/docs/source/Instruction/Command-line-parameters.md b/docs/source/Instruction/Command-line-parameters.md
@@ -576,25 +576,26 @@ RLHF参数继承于[训练参数](#训练参数)。
 - lmbda: 默认为0.5。该参数在GKD中使用。控制学生数据比例的 lambda 参数（即策略内学生生成输出所占的比例）。若lmbda为0，则不使用学生生成数据。
 - sft_alpha: 默认为0。控制GKD中加入sft_loss的权重。最后的loss为`gkd_loss + sft_alpha * sft_loss`。
 - gkd_logits_topk: 使用 Top-K logits 计算 KL 散度，默认为 None（即使用完整词表计算）。设置该参数可有效降低训练显存峰值；当配置 teacher_model_server 时，此参数为必填项。详见[GKD 文档](./GKD.md#top-k-kl-计算)。
-- offload_teacher_model: 卸载教师模型以节约显存，只在采样/计算logps时加载，默认为False。
 - truncation_strategy: 用于处理输入长度超过 max_length 的样本，支持 delete 和 left 两种策略，分别表示删除该样本和从左侧裁剪。默认值为 left。若使用 delete 策略，被删除的超长样本或编码失败的样本将在原数据集中通过重采样进行替换。
 - log_completions: 是否记录训练中的模型生成内容，搭配 `--report_to wandb/swanlab` 使用。默认为False。
   - 提示：若没有设置`--report_to wandb/swanlab`，则会在checkpoint中创建`completions.jsonl`来存储生成内容。
   - 仅记录 vLLM 采样结果。
 
 #### Reward/Teacher模型参数
-reward模型参数将在PPO、GRPO中使用。
+reward模型参数将在PPO、GRPO中使用；teacher模型参数在GKD与GRPO中使用。
 
 - reward_model: 默认为None。
 - reward_adapters: 默认为`[]`。
 - reward_model_type: 默认为None。
 - reward_model_revision: 默认为None。
-- teacher_model: 默认为None。rlhf_type为'gkd'时需传入此参数。
+- teacher_model: 默认为None。
 - teacher_adapters: 默认为`[]`。
 - teacher_model_type: 默认为None。
 - teacher_model_revision: 默认为None。
-- teacher_model_server: 教师模型服务地址, 如：`http://localhost:8000`, 使用`vllm serve`部署的服务端计算top-k-logps。
+- teacher_model_server: 教师模型服务地址, 如：`http://localhost:8000`, 使用`swift deploy`部署的服务端计算logps。
 - teacher_deepspeed: 同 deepspeed 参数，控制 teacher model 的 deepspeed 配置，默认使用训练模型的 deepspeed 配置。
+- offload_teacher_model: 卸载教师模型以节约显存，在采样/计算logps时加载，仅在设置teacher_model时生效默认为False。
+
 
 #### PPO参数
 
@@ -675,6 +676,7 @@ reward模型参数将在PPO、GRPO中使用。
 - scale_rewards: 指定奖励的缩放策略。可选值包括 `group`（按组内标准差缩放）、`batch`（按整个批次的标准差缩放）、`none`（不进行缩放）、`gdpo`（对每个奖励函数分别进行组内归一化后加权聚合，参考 [GDPO 论文](https://arxiv.org/abs/2601.05242)）。默认值与 `advantage_estimator` 绑定：`grpo` 对应 `group`，`rloo` 对应 `none`，`reinforce_plus_plus` 对应 `batch`。
   - 注意：`gdpo` 模式不支持 `kl_in_reward=True`，若同时设置会自动将 `kl_in_reward` 设为 `False`。
   - GDPO 适用于多奖励优化场景：当使用多个奖励函数时，GDPO 会对每个奖励函数分别在组内进行标准化（减均值、除标准差），然后使用 `reward_weights` 进行加权求和，最后再进行批次级别的标准化。这种方式可以更好地保留各个奖励的相对差异，避免不同奖励组合坍塌成相同的 advantage 值。
+- teacher_kl_coef: OPD-RL中teacher_kl的系数，即 `adv_t = base_adv - teacher_kl_coef * teacher_kl`。默认为 1.0。
 - sync_ref_model: 是否定期同步ref_model，默认为False。
   - ref_model_mixup_alpha: 控制在更新过程中model和先前ref_model之间的混合。更新公式为 $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$。默认为0.6。
   - ref_model_sync_steps: 同步频率，默认为512。

diff --git a/docs/source/Megatron-SWIFT/Command-line-parameters.md b/docs/source/Megatron-SWIFT/Command-line-parameters.md
@@ -397,23 +397,31 @@ Megatron训练参数继承自Megatron参数和基本参数（**与ms-swift共用
 - log_rollout_offpolicy_metrics: 当 `rollout_importance_sampling_mode` 未设置时，是否记录训推不一致诊断指标（KL、PPL、χ²等）。当设置了 `rollout_importance_sampling_mode` 时，指标会自动记录。默认为False。
 - off_policy_sequence_mask_delta: Off-Policy Sequence Masking 阈值，来自 DeepSeek-V3.2 论文。当设置此值时，会计算每个序列的 `mean(old_policy_logps - policy_logps)`，若该值大于阈值且该序列的优势为负，则 mask 掉该序列不参与损失计算。默认为None，不启用。具体参考[文档](../Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md#off-policy-sequence-masking)。
 - router_replay_mode: 路由重放模式，可选项为`disabled`、`R2`、`R3`。默认为disabled，不启用路由重放。
+- teacher_kl_coef: OPD-RL中teacher_kl的系数，即 `adv_t = base_adv - teacher_kl_coef * teacher_kl`。默认为 1.0。
 
 内置奖励函数参数参考[文档](../Instruction/Command-line-parameters.md#奖励函数参数)
 
 ### GKD参数
-- teacher_model: 教师模型的路径或模型 ID，必需参数。
-- teacher_model_type: 教师模型类型，默认为None，自动检测。
-- teacher_model_revision: 教师模型版本，默认为None。
+
 - beta: JSD 散度插值系数。0.0 代表 Forward KL，0.5 代表对称 JSD，1.0 代表 Reverse KL。默认为0.5。
 - lmbda: On-Policy 学习触发概率。0.0 代表纯 Off-Policy，1.0 代表纯 On-Policy。默认为0.5。
 - temperature: 用于采样和损失计算的温度参数。默认为0.9。
-- offload_teacher_model: 是否将教师模型卸载到 CPU 以节省 GPU 显存。默认为False。
 - sft_alpha: SFT 损失的混合系数，`loss = jsd_loss + sft_alpha * sft_loss`。当使用数据集响应（Off-Policy）时生效。默认为0。
 - max_completion_length: 生成时的最大 token 数。默认为512。
 - vllm_mode: 同 GRPO 参数，用于 On-Policy 生成。colocate 模式下在程序内部署 vLLM。
   - 注意：On-Policy 生成需要启用 vLLM（`--use_vllm true --vllm_mode colocate/server`）。
   - 当 `lmbda > 0` 但未启用 vLLM 时，将自动回退到 Off-Policy 模式。
 
+### teacher 参数
+在GKD与GRPO中使用
+
+- teacher_model: 教师模型的路径或模型 ID，必需参数。
+- teacher_model_type: 教师模型类型，默认为None，自动检测。
+- teacher_model_revision: 教师模型版本，默认为None。
+- teacher_model_server: 教师模型服务地址, 如：`http://localhost:8000`, 使用`swift deploy`部署的服务端计算logps。
+- offload_teacher_model: 是否将教师模型卸载到 CPU 以节省 GPU 显存。默认为False。
+
+
 ## 导出参数
 这里介绍`megatron export`的参数，若要使用`swift export`导出命令，请参考[ms-swift命令行参数文档](../Instruction/Command-line-parameters.md#导出参数)。`megatron export`相比`swift export`，支持分布式和多机导出。Megatron导出参数继承自Megatron参数和基本参数。
 - 🔥to_mcore: HF格式权重转成Megatron格式。默认为False。

diff --git a/docs/source_en/Instruction/Command-line-parameters.md b/docs/source_en/Instruction/Command-line-parameters.md
@@ -587,28 +587,27 @@ RLHF arguments inherit from the [training arguments](#training-arguments).
 #### GKD Arguments
 - lmbda: Default is 0.5. This parameter is used in GKD. It controls the lambda parameter for the proportion of student data (i.e., the proportion of student-generated outputs within the strategy). If lmbda is 0, student-generated data is not used.
 - sft_alpha: The default value is 0. It controls the weight of sft_loss added in GKD. The final loss is `gkd_loss + sft_alpha * sft_loss`.
-  - Note: You can perform inference on the dataset using the teacher model in advance (accelerated by inference engines such as vLLM, SGLang, or lmdeploy), and use the teacher-generated data directly as dataset.
 - gkd_logits_topk: Use Top-K logits to compute KL divergence. Defaults to None, which means the full vocabulary is used. Setting this parameter can effectively reduce peak GPU memory usage during training. This parameter is required when teacher_model_server is configured. See [GKD documentation](./GKD.md#top-k-kl-computation) for more details.
-- offload_teacher_model: Whether to offload the teacher model to save GPU memory. If set to True, the teacher model will be loaded only during generate/logps computation. Default: False.
 - truncation_strategy: The method to handle inputs exceeding `max_length`. Supported values are `delete` and `left`, representing deletion and left-side truncation respectively. The default is `left`. With the delete strategy, over-long or encoding-failed samples are discarded, and new samples are resampled from the original dataset to maintain the intended batch size.
 - log_completions: Whether to log the model-generated content during training, to be used in conjunction with `--report_to wandb/swanlab`, default is False.
   - Note: If `--report_to wandb/swanlab` is not set, a `completions.jsonl` will be created in the checkpoint to store the generated content.
   - Log vLLM rollout results only.
 
 #### Reward/Teacher Model Parameters
 
-The reward model parameters will be used in PPO and GRPO. The teacher model parameters will be used in GKD.
+The reward model parameters will be used in PPO and GRPO. The teacher model parameters will be used in GKD and GRPO.
 
 - reward_model: Default is None.
 - reward_adapters: Default is `[]`.
 - reward_model_type: Default is None.
 - reward_model_revision: Default is None.
-- teacher_model: Default is None. This parameter must be provided when `rlhf_type` is `'gkd'`.
+- teacher_model: Default is None.
 - teacher_adapters: Default is `[]`.
 - teacher_model_type: Default is None.
 - teacher_model_revision: Default is None.
-- teacher_model_server: The address of the teacher model server, e.g. `http://localhost:8000`. This should be a service deployed via `vllm serve` for computing top-k log probabilities.
+- teacher_model_server: The address of the teacher model server, e.g. `http://localhost:8000`. Deploy via `swift deploy` for logprobs.
 - teacher_deepspeed: Same as the deepspeed parameter, controls the DeepSpeed configuration for the teacher model. By default, uses the DeepSpeed configuration of the training model.
+- offload_teacher_model: Whether to offload the teacher model to save GPU memory. Loaded only during sampling/logps computation. Only effective when `teacher_model` is set. Default is False.
 
 
 #### PPO Arguments
@@ -694,6 +693,7 @@ The hyperparameters for the reward function can be found in the [Built-in Reward
 - scale_rewards: Specifies the reward scaling strategy. Options: `group` (scale by intra-group std), `batch` (scale by batch-wide std), `none` (no scaling), `gdpo` (normalize each reward function separately within groups before weighted aggregation, see [GDPO paper](https://arxiv.org/abs/2601.05242)). The default is bound to `advantage_estimator`: `group` for `grpo`, `none` for `rloo`, and `batch` for `reinforce_plus_plus`.
   - Note: `gdpo` mode does not support `kl_in_reward=True`. If both are set, `kl_in_reward` will be automatically set to `False`.
   - GDPO is designed for multi-reward optimization: When using multiple reward functions, GDPO normalizes each reward function separately within groups (subtract mean, divide by std), then performs weighted aggregation using `reward_weights`, and finally applies batch-level normalization. This approach better preserves the relative differences between rewards and prevents different reward combinations from collapsing into identical advantage values.
+- teacher_kl_coef: Coefficient for the teacher signal in OPD-RL, i.e. `adv_t = base_adv + teacher_kl_coef * (teacher_logp - student_logp)` (the signed k1 reverse-KL reward; teacher-preferred tokens get a positive advantage). Default is 1.0.
 - sync_ref_model: Whether to synchronize the reference model. Default is False.
   - ref_model_mixup_alpha: The Parameter controls the mix between the current policy and the previous reference policy during updates. The reference policy is updated according to the equation: $π_{ref} = α * π_θ + (1 - α) * π_{ref_{prev}}$. Default is 0.6.
   - ref_model_sync_steps：The parameter determines how frequently the current policy is synchronized with the reference policy. Default is 512.

diff --git a/docs/source_en/Megatron-SWIFT/Command-line-parameters.md b/docs/source_en/Megatron-SWIFT/Command-line-parameters.md
@@ -420,24 +420,31 @@ In addition to inheriting the training parameters, the following parameters are
 - log_rollout_offpolicy_metrics: Whether to log training-inference mismatch diagnostic metrics (KL, PPL, χ², etc.) when `rollout_importance_sampling_mode` is not set. When `rollout_importance_sampling_mode` is set, metrics are always logged. Default is False.
 - off_policy_sequence_mask_delta: Off-Policy Sequence Masking threshold from [DeepSeek-V3.2 paper](https://arxiv.org/abs/2512.02556). When set, computes `mean(old_policy_logps - policy_logps)` for each sequence. If this value exceeds the threshold AND the sequence has negative advantage, the sequence is masked out from loss computation. For details, refer to the [documentation](../Instruction/GRPO/AdvancedResearch/training_inference_mismatch.md#off-policy-sequence-masking).
 - router_replay_mode: Router replay mode. Options are `disabled`,`R2`,`R3`. Default is disabled.
+- teacher_kl_coef: Coefficient for teacher KL in OPD-RL, i.e. `adv_t = base_adv - teacher_kl_coef * teacher_kl`. Default is 1.0.
 
 Built-in reward function parameters refer to the [documentation](../Instruction/Command-line-parameters.md#reward-function-parameters).
 
 ### GKD Parameters
 
-- teacher_model: Path or model ID of the teacher model. Required.
-- teacher_model_type: Teacher model type. Default is None, auto-detected.
-- teacher_model_revision: Teacher model version. Default is None.
 - beta: JSD divergence interpolation coefficient. 0.0 means Forward KL, 0.5 means symmetric JSD, 1.0 means Reverse KL. Default is 0.5.
 - lmbda: On-Policy learning probability. 0.0 means pure Off-Policy, 1.0 means pure On-Policy. Default is 0.5.
 - temperature: Temperature for sampling and loss computation. Default is 0.9.
-- offload_teacher_model: Whether to offload teacher model to CPU to save GPU memory. Default is False.
 - sft_alpha: Mixing coefficient for SFT loss, `loss = jsd_loss + sft_alpha * sft_loss`. Takes effect when using dataset responses (Off-Policy). Default is 0.
 - max_completion_length: Maximum tokens for generation. Default is 512.
 - vllm_mode: Same as GRPO parameter, used for On-Policy generation. Colocate mode deploys vLLM within the program.
   - Note: On-Policy generation requires vLLM (`--use_vllm true --vllm_mode colocate/server`).
   - When `lmbda > 0` but vLLM is not enabled, it will automatically fall back to Off-Policy mode.
 
+### Teacher Parameters
+
+Used in GKD and GRPO.
+
+- teacher_model: Path or model ID of the teacher model. Required.
+- teacher_model_type: Teacher model type. Default is None, auto-detected.
+- teacher_model_revision: Teacher model version. Default is None.
+- teacher_model_server: Teacher model service URL, e.g. `http://localhost:8000`. Deploy via `swift deploy` for logprobs.
+- offload_teacher_model: Whether to offload teacher model to CPU to save GPU memory. Default is False.
+
 ## Export Parameters
 
 This section introduces the parameters for `megatron export`. To use the `swift export` command for exporting, please refer to the [ms-swift Command Line Parameters Documentation](../Instruction/Command-line-parameters.md#export-arguments). Compared to `swift export`, `megatron export` supports distributed and multi-node exporting. Megatron export parameters inherit from Megatron parameters and basic parameters.

diff --git a/examples/megatron/rlhf/opd_rl/dense.sh b/examples/megatron/rlhf/opd_rl/dense.sh
@@ -0,0 +1,52 @@
+# Megatron On-Policy Distillation as RL (OPD-RL): the signed teacher log-ratio as a GRPO advantage.
+#
+# Same teacher as Megatron GKD (`--rlhf_type gkd`); the only change is `--rlhf_type grpo`.
+# OPD-RL keeps the GRPO policy-gradient pipeline (PG OPD) and injects the per-token signed
+# teacher log-ratio `teacher_logp - student_logp` on the student-sampled tokens as an
+# *advantage* (the k1 reverse-KL reward; teacher-preferred tokens get a positive advantage).
+# With no `--reward_funcs`, the teacher signal is the sole training signal (pure distillation);
+# add `--reward_funcs` to mix task reward with it. `--teacher_kl_coef` (default 1.0) scales it.
+CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
+NPROC_PER_NODE=8 \
+PYTORCH_CUDA_ALLOC_CONF='expandable_segments:True' \
+megatron rlhf \
+    --rlhf_type grpo \
+    --model Qwen/Qwen3-8B-Base \
+    --teacher_model Qwen/Qwen3-32B \
+    --tuner_type full \
+    --dataset AI-MO/NuminaMath-TIR#5000 \
+    --tensor_model_parallel_size 2 \
+    --pipeline_model_parallel_size 2 \
+    --context_parallel_size 2 \
+    --advantage_estimator grpo \
+    --beta 0.0 \
+    --torch_dtype bfloat16 \
+    --micro_batch_size 2 \
+    --global_batch_size 16 \
+    --num_generations 8 \
+    --steps_per_generation 4 \
+    --num_train_epochs 1 \
+    --lr 1e-6 \
+    --logging_steps 1 \
+    --max_length 8192 \
+    --max_completion_length 4096 \
+    --attention_backend flash \
+    --use_vllm true \
+    --vllm_mode colocate \
+    --vllm_gpu_memory_utilization 0.5 \
+    --vllm_tensor_parallel_size 1 \
+    --vllm_max_model_len 16384 \
+    --sleep_level 1 \
+    --offload_model true \
+    --offload_optimizer true \
+    --offload_teacher_model true \
+    --recompute_granularity selective \
+    --finetune \
+    --no_save_optim \
+    --no_save_rng \
+    --temperature 1.0 \
+    --padding_free true \
+    --sequence_parallel true \
+    --log_completions true \
+    --train_iters 200 \
+    --save_steps 1000
diff --git a/examples/ray/opd_rl/opd_rl_colocate.yaml b/examples/ray/opd_rl/opd_rl_colocate.yaml
@@ -0,0 +1,62 @@
+# Ray Megatron On-Policy Distillation as RL (OPD-RL): the signed teacher log-ratio as a GRPO advantage.
+#
+# Same colocated teacher as Ray GKD; the only change is `rlhf_type: grpo`. OPD-RL keeps the
+# GRPO policy-gradient pipeline (PG OPD) and injects the per-token signed teacher log-ratio
+# `teacher_logp - student_logp` on the student-sampled tokens as an advantage (the k1 reverse-KL
+# reward; teacher-preferred tokens get a positive advantage). Omit `reward_funcs` for pure
+# distillation, or add them to mix task reward with it. `teacher_kl_coef` (default 1.0) scales it.
+# Note: the Ray pipeline supports a colocated `teacher_model`; `teacher_model_server` is not
+# wired yet.
+rlhf_type: grpo
+
+model: Qwen/Qwen3.5-2B
+teacher_model: Qwen/Qwen3.5-9B
+
+dataset: modelscope/gsm8k
+dataset_num_proc: 4
+split_dataset_ratio: 0
+
+micro_batch_size: 2
+global_batch_size: 16
+num_generations: 1
+steps_per_generation: 4
+num_train_epochs: 1
+logging_steps: 1
+seed: 42
+max_length: 2048
+max_completion_length: 2048
+padding_free: false
+cross_entropy_loss_fusion: true
+gradient_accumulation_fusion: false
+lr: 3e-5
+lr_warmup_fraction: 0.0
+attention_backend: flash
+temperature: 1.0
+beta: 0
+teacher_kl_coef: 1.0
+
+use_vllm: true
+
+colocate_groups: [[train, rollout]]
+offload_model: true
+offload_optimizer: true
+offload_teacher_model: true
+sleep_level: 1
+
+save_steps: 100
+no_save_optim: true
+no_save_rng: true
+
+train:
+  gpus: 4
+  tuner_type: lora
+  lora_rank: 8
+  lora_alpha: 32
+  tensor_model_parallel_size: 1
+  output_dir: megatron_output/ray_opd_rl_colocate
+
+rollout:
+  gpus: 4
+  vllm_tensor_parallel_size: 1
+  vllm_gpu_memory_utilization: 0.4
+  vllm_max_model_len: 4096
diff --git a/examples/ray/opd_rl/run.sh b/examples/ray/opd_rl/run.sh
@@ -0,0 +1,7 @@
+#!/bin/bash
+# Ray Megatron OPD-RL (On-Policy Distillation as RL) — colocate mode + colocated teacher.
+export CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES:-0,1,2,3}
+
+SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+
+megatron rlhf --use_ray true --config "$SCRIPT_DIR/opd_rl_colocate.yaml"