ReaDe

A Reason-then-Describe Instruction Interpreter for Controllable Video Generation

▶ ¹Kuaishou Technology ▶ ²National University of Singapore

(^*Work done during internship at Kuaishou Technology. ^✉Correspondence)

TL;DR: We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators.

Example 1 [Image + Text -> Video]

A video showing what the animal growing up looks like in the given image

Model: Kiling

Model: Kiling + ReaDe

Example 2 [Text -> Video]

A video showing what happens after a cup of hot tea is set on a glass table near an open window on a cold day.

Model: Kiling

Model: Kiling + ReaDe

Example 3 [Text + IDs + Camera -> Video]

A relaxed city stroll: long strides, a splash of red at the head, and a green strap on a small, cloudlike companion; the background is a Gothic front of twin towers and lace-like stonework.

Model: FullDiT

Model: FullDiT + ReaDe

Example 4 [Text -> Video]

A video depicting decorations placed at a home entrance during Halloween.

Model: CogVideoX

Model: CogVideoX + ReaDe

Example 5 [Text -> Video]

Garden Zoom in.

Model: CogVideoX

Model: CogVideoX + ReaDe

Example 6 [Text + IDs -> Video]

Generate a video showing what the plant in the image develops and is harvested for its fruit.

Model: CogVideoX

Model: CogVideoX + ReaDe

Example 7 [Text + Sketch -> Video]

A phenomenon accompanied by thick smoke and flowing hot liquid.

Model: SketchVideo

Model: SketchVideo + ReaDe

Example 8 [Text -> Video]

A timelapse captures the reaction as concentrated sulfuric acid is poured onto a plastic spoon.

Model: CogVideoX

Model: CogVideoX + ReaDe

Example 9 [Text -> Video]

Generate a time-lapse that infers morning to afternoon progression.

Model: Kling

Model: Kling + ReaDe

Example 10 [Text + IDs -> Video]

A dog is playing the guitar, and the man is looking at this scene.

Model: Kling

Model: Kling + ReaDe

Example 11 [Text + IDs -> Video]

A video showing the location in the image filled with people who are laughing and chatting.

Model: Kling

Model: Kling + ReaDe

Example 12 [Text + Camera -> Video]

Peaceful nighttime city street lit by dim streetlights with distant city lights in the background. A sidewalk building displays illuminated “ART CLINIC” signage with its windows and door clearly visible, not obscured. Keep the street empty and quiet, not busy, with minimal traffic.

Model: FullDiT

Model: FullDiT + ReaDe

Example 13 [Text + Audio -> Video]

A busy street at night.

Model: Kling

Model: Kling + ReaDe

Example 14 [Text + Audio -> Video]

People are watching fireworks in the night sky.

Model: Kling

Model: Kling + ReaDe

Abstract

Diffusion Transformers have significantly improved video fidelity and temporal coherence; however, practical controllability remains limited. Concise, ambiguous, and compositionally complex user inputs contrast with the detailed prompts used in training, yielding an intent–output mismatch. We propose ReaDe, a universal, model-agnostic interpreter that converts raw instructions into precise, actionable specifications for downstream video generators. ReaDe follows a reason-then-describe paradigm: it first analyzes the user request to identify core requirements and resolve ambiguities, then produces detailed guidance that enables faithful, controllable generation. We train ReaDe via a two-stage optimization: (i) reasoning-augmented supervision imparts analytic parsing with stepwise traces and dense captions; (ii) a multi-dimensional reward assigner enables stable, feedback-driven refinement for natural-style captions. Experiments across singleand multi-condition scenarios show consistent gains in instruction fidelity, caption accuracy, and downstream video quality, with strong generalization to reasoningintensive and unseen inputs. ReaDe offers a practical route to aligning controllable video generation with accurately interpreted user intent.

Method

ReaDe is a multimodal LLM initialized from Qwen2.5-Omni. Various user-provided conditions are processed by their corresponding encoders (text, visual, video, audio, and camera), and the extracted features are integrated by the Qwen-LLM to perform reasoning. The model outputs a deeply interpreted dense prompt for the downstream video generator.

Figure 1: Overall illustration of the multimodal encoding framework.

Figure 2: Text-only prompt optimizers and data-hungry multimodal methods (e.g., Any2Caption remain brittle, performing poorly on reasoning-intensive and unseen instructions.

Inspired by Chain-of-Thought (CoT), ReaDe emulates a human-like reasoning process that systematically interprets the initial prompt into its core requirements, resolving cross-modal misalignments and ambiguities, and enriches it with explicit details to enable faithful and high-quality video generation. Technically, we propose a multi-dimensional feedback reinforcement learning framework comprising two stages:

Stage 1, the interpreter is equipped with initial analytic parsing capabilities for instruction refinement, utilizing curated, reasoning-augmented data that pairs user inputs with stepwise reasoning traces and gold dense, detailed captions.
Stage 2, we design a multi-dimensional feedback reward assigner to overcome the intrinsic difficulty of evaluating naturally styled captions, enabling stable feedback-driven optimization that steers the model to infer user intent more accurately and generate detailed captions suitable for controllable video generation.

Figure 3: Overview of the training framework for the Instruction Interpreter (ReaDe). (1) CoTguided reasoning initialization via supervised fine-tuning on instruction–thinking–answer triples, and (2) reinforcement learning with a multi-dimensional reward assigner and optional video-quality feedback from a frozen video generator.

BibTeX

@article{wu2025reason, title={A Reason-then-Describe Instruction Interpreter for Controllable Video Generation}, author={Wu, Shengqiong and Ye, Weicai and Zhang, Yuanxing and Wang, Jiahao and Liu, Quande and Wang, Xintao and Wan, Pengfei and Gai, Kun and Fei, Hao and Chua, Tat-Seng}, journal={arXiv preprint arXiv:2511.20563}, year={2025} }