Teaser

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

1Kuaishou Technology   2National University of Singapore  
(*Work done during internship at Kuaishou Technology. Correspondence)

TL;DR: We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators.

Abstract

Diffusion Transformers have significantly improved video fidelity and temporal coherence; however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent–output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions; (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across singleand multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoningintensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent.

Method

ReaDe is a multimodal LLM initialized from Qwen2.5-Omni. Various user-provided conditions are processed by their corresponding encoders (text, visual, video, audio, and camera), and the extracted features are integrated by the Qwen-LLM to perform reasoning. The model outputs a deeply interpreted dense prompt for the downstream video generator.

Teaser

Figure 1: Overall illustration of the multimodal encoding framework.

ReaDe is a multimodal LLM initialized from Qwen2.5-Omni. Various user-provided conditions are processed by their corresponding encoders (text, visual, video, audio, and camera), and the extracted features are integrated by the Qwen-LLM to perform reasoning. The model outputs a deeply interpreted dense prompt for the downstream video generator.

Teaser

Figure 2: Text-only prompt optimizers and data-hungry multimodal methods (e.g., Any2Caption remain brittle, performing poorly on reasoning-intensive and unseen instructions.

Inspired by Chain-of-Thought (CoT), ReaDe emulates a human-like reasoning process that systematically interprets the initial prompt into its core requirements, resolving cross-modal misalignments and ambiguities, and enriches it with explicit details to enable faithful and high-quality video generation. Technically, we propose a multi-dimensional feedback reinforcement learning framework comprising two stages:

  • Stage 1, the interpreter is equipped with initial analytic parsing capabilities for instruction refinement, utilizing curated, reasoning-augmented data that pairs user inputs with stepwise reasoning traces and gold dense, detailed captions.
  • Stage 2, we design a multi-dimensional feedback reward assigner to overcome the intrinsic difficulty of evaluating naturally styled captions, enabling stable feedback-driven optimization that steers the model to infer user intent more accurately and generate detailed captions suitable for controllable video generation.

Teaser

Figure 3: Overview of the training framework for the Instruction Interpreter (ReaDe). (1) CoTguided reasoning initialization via supervised fine-tuning on instruction–thinking–answer triples, and (2) reinforcement learning with a multi-dimensional reward assigner and optional video-quality feedback from a frozen video generator.

BibTeX

@article{wu2025reason,
  title={A Reason-then-Describe Instruction Interpreter for Controllable Video Generation},
  author={Wu, Shengqiong and Ye, Weicai and Zhang, Yuanxing and Wang, Jiahao and Liu, Quande and Wang, Xintao and Wan, Pengfei and Gai, Kun and Fei, Hao and Chua, Tat-Seng},
  journal={arXiv preprint arXiv:2511.20563},
  year={2025}
}