Alibaba researchers have unveiled a technical report presenting Qwen-Image-2.0-RL, a post-training pipeline applying Reinforcement Learning from Human Feedback (RLHF) and On-Policy Distillation (OPD) to the #Qwen-Image-2.0 #diffusion model. The primary goal is to enhance both the visual quality and instruction-following capability of generative image models.
To generate reliable reward signals, the team constructed task-specific composite reward models. By fine-tuning Vision-Language Models (VLMs) with a pointwise scoring paradigm and Chain-of-Thought (CoT) reasoning, these reward models evaluate alignment, aesthetics, and portrait fidelity for text-to-image (T2I) generation, while addressing instruction-following accuracy and identity preservation for image editing tasks.
Building on this reward system, the authors developed a scalable GRPO-based RL training framework. It incorporates a hybrid classifier-free guidance (CFG) strategy to preserve pre-trained knowledge, utilizes intra-group reward range filtering for prompt curation, and calibrates reward weights per category. To consolidate specialized T2I and editing policies, they proposed On-Policy Distillation as the final training stage, merging multiple teachers into a single student model through trajectory-level velocity matching.
Extensive evaluation demonstrates that Qwen-Image-2.0-RL achieves an overall score of 57.84 on Qwen-Image-Bench (+2.61 over the base model). Additionally, it secured Elo ratings of 1193 in the text-to-image arena (+78) and 1349 in the image edit arena (+93), illustrating substantial gains in aesthetic quality and prompt adherence.
[AgentUpdate Depth Analysis] The adaptation of the GRPO framework to diffusion models marks a pivotal milestone for the AI Agent ecosystem. As agents transition from text-only orchestrators to multi-modal actuators capable of manipulating physical or digital environments, precise instruction-following in visual tasks becomes paramount. Qwen-Image-2.0-RL successfully bridges the gap between creative visual generation and precise execution. By consolidating specialized generation and editing tasks through On-Policy Distillation, it provides a blueprint for building multi-modal agents that can iteratively edit UI mockups, generate consistent assets, or assist in physical reasoning tasks with high-fidelity closed-loop visual feedback.