To address the bottleneck of accurate user intent interpretation within current video generation community, we present Any2Caption
, a novel framework for controllable video generation from any condition.
The key idea is decoupling various condition interpretation steps from the video synthesis step.
By leveraging modern multimodal large language models (MLLMs), Any2Caption
interprets diverse inputs—text, images, videos, and specialized cues such as region, motion, and camera poses—into dense, structured captions that offer backbone video generators with better guidance.
We also introduce Any2CapIns
, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning.
Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various arts of existing video generation models.
Figure 1: Any2Caption, an efficient and versatile framework for interpreting diverse conditions to structured captions, which then can be fed into any video generator to generate highly controllable videos.
To facilitate the any-to-caption instruction tuning for Any2Caption, we construct Any2CapIns, a large-scale dataset that converts a concise user prompt and diverse
non-text conditions into detailed, structured captions. Concretely, the dataset encompasses four main categories of conditions: depth maps, multiple identities, human poses,
and camera poses. Through extensive manual labeling combined with automated annotation by GPT-4V, followed by rigorous human verification, we curate a total of 337K
high-quality instances, with 407K
condition annotations, with the short prompts and structured captions averaging 55 and 231 words, respectively
Figure 2: The pipeline for constructing the Any2CapIns
dataset involves three key steps: 1) data collection, 2) structured video caption generation, and 3) user-centric short prompt generation.
we propose Any2Caption
, an MLLM-based universal condition interpreter designed not only to handle text, image, and video inputs but also equipped with specialized modules for motion and camera pose inputs.
As illustrated in Fig. 3, Any2Caption
takes as inputs any/diverse condition (or combination), and produces a densely structured caption, which is then passed on to any backend video generator for controllable, high-quality video production.
Figure 3: Architecture illustration of Any2Caption
, where Qwen2-LLM serves as the backbone and is paired with text, image, video, motion, and camera encoders to produce structured captions.
To further showcase the model's capacity to understand and leverage input conditions, we directly feed the structured captions—derived from our model's interpretation—into downstream text-to-video generation systems (e.g., CogvideoX and Hunyuan). Here, we show the given short prompt and controllable condtions, and the generated videos only based on the structured captions.
Short Prompt | Controls | Structured Caption | Generated Video | |
---|---|---|---|---|
(Hunyuan) A woman in red dress walks in a minimalist, modern room. She is holding two mugs and looks slightly displeased. The room has natural light. |
![]() |
![]() |
1. Dense Caption: A woman in a red dress is seen walking through a modern, well-lit interior space. She carries two white mugs—one in each hand—as she moves from a counter area with a coffee machine and pastries toward a windowed section of the room. Her expression suggests she may be in thought or reacting to something unseen. The environment features minimalist design, including white walls and a large black-and-white photo. Natural light streams through the windows, giving the space an open and airy feel. |
![]() |
(Hunyuan) A young man carrying a messenger bag runs down a narrow, cobblestone street filled with sandbags and crates, suggesting a wartime. |
![]() |
1. Dense Caption: A young man is seen running down a narrow cobblestone street, flanked by sandbags and crates that evoke a wartime setting. Dressed in a brown coat and carrying a messenger bag, he moves quickly through a corridor of old European-style buildings. The camera follows his movements, capturing both his hurried pace and the tense atmosphere around him. The scene conveys a strong sense of urgency and determination, set against a quiet, war-ravaged backdrop. |
![]() |
|
(Hunyuan) A man is adjusting his cap and looking around occasionally. Surroundings include a suburban neighborhood with a brick house, a white van, and an American flag. The weather is sunny, with trees and a clear blue sky. The man seems slightly frustrated, talking to the camera with natural lighting. |
![]() |
![]() |
1. Dense Caption: A man dressed in a white shirt and a greenish baseball cap is seen standing outdoors in a quiet suburban neighborhood. He occasionally adjusts his cap and neck while looking around his surroundings. The footage captures him from a close-up angle, highlighting his facial expressions and upper body. In the background, a white van, a brick house, and an American flag are visible, suggesting a bright and sunny day. |
![]() |
(CogVideoX) A serene winter backyard with snow-covered ground and bare trees, revealing a blue shed with a white garage door and a doghouse. |
![]() |
![]() |
1. Dense Caption: The video presents a tranquil backyard scene set during winter. It opens with a view of a wooden deck in the foreground before the camera gradually pans to reveal a small blue shed with a white garage door, accompanied by a doghouse to its left. In the center of the frame, a red slide stands out against the snow-covered ground. Surrounding the area are bare trees, reinforcing the cold and stillness of the season. The scene is calm and static, with no visible movement of people or animals, highlighting the peacefulness of a quiet winter day. |
![]() |
(CogVideoX)A man gestures while the woman listens. They sit in a sunny park. The camera captures close-up shots of their heads and shoulders. |
![]() |
1. Dense Caption: Two individuals are seated outdoors on a sunny day, engaged in a relaxed conversation. The woman, dressed in a blue and white patterned outfit and wearing a wide-brimmed hat, sits on the left, while the man, in a dark shirt and sunglasses, is on the right. They appear to be in a tranquil outdoor environment, such as a park or garden, with green grass and trees surrounding them. As they talk, the man occasionally gestures, and the woman listens with interest. |
![]() |
Given multiple identities, and a short prompt, we compared the video generation results with short / structured prompt.
Short Prompt | Controls | Generated Video w/ short prompt | Generated Video w/ short prompt + Condition Caption | Generated Video w/ structured prompt |
---|---|---|---|---|
A man in a dark suit stands outdoors, initially looking distressed. The camera captures a close-up of his face, showing a bandage on his forehead. Suddenly, he shifts his stance urgently, turns his head to the side, and raises a handgun with determination. The setting is plain, possibly showing a clear sky, and the lighting suggests it's daytime. | ![]() |
![]() |
![]() |
![]() |
A young girl wearing a school uniform and a young man in casual clothes are walking side by side along a dimly lit concrete wall at night. The girl walks on the left while the boy rides a bicycle on the right. The background is urban and gritty, with warm, moody lighting. The camera follows them closely, capturing a medium close-up shot of their upper bodies from different angles as they move. The scene has a nostalgic and contemplative atmosphere. |
![]() |
![]() |
![]() |
![]() |
A woman in a skirt dances on the snow with her hair flying. |
![]() |
|||
A whale is floating in the girl's palm, and the camera gradually zooms in. |
![]() |
Given a camera trajectory and a short prompt, we compared the video generation results with short / structured prompt.
Short Prompt | Controls | Generated Video w/ short prompt | Generated Video w/ structured prompt |
---|---|---|---|
A serene video of a large house with a red roof and a spacious porch, surrounded by lush greenery. A peaceful countryside setting with vibrant colors and a tranquil atmosphere. |
![]() |
![]() |
![]() |
A well-lit dining and living room with elegant and classic decor. The dining table is surrounded by chairs and has a chandelier above it. There's a wooden cabinet against the wall. The background features a hallway with a staircase and another dining area visible. The decor includes wooden furniture and framed pictures on the walls. |
![]() |
![]() |
![]() |
The scene is bathed in bright sunlight, emphasizing the warm and inviting atmosphere. A modern house with large windows and a balcony is showcased. Potted plants accent the architectural details. Distant mountains frame the view. Lush greenery surrounds the scene. The sky is a clear blue, dotted with scattered clouds. A dreamy lens flare effect adds to the serene quality. The overall ambiance is tranquil and picturesque. |
![]() |
![]() |
![]() |
Given a depth sequence, multiple identities, and a short prompt, we compared the video generation results with short / structured prompt.
Short Prompt | Controls | Generated Video w/ short prompt | Generated Video w/ structured prompt | |
---|---|---|---|---|
Young woman with cat ears holding a white cat in a sunlit meadow. Blue butterflies flutter around. Gentle caresses and a serene, magical atmosphere. Trees cast a warm glow. Fixed camera at eye level capturing upper body. Whimsical, enchanting style. |
![]() |
![]() |
![]() |
![]() |
A young girl dances in a bright, colorful room. A fluffy dog joyfully mimics her moves. The room features unique, colorful furniture with large windows letting in natural light. The dance is lively and expressive. The atmosphere is cheerful and vibrant. The camera captures full body movements at eye level. |
![]() |
![]() |
![]() |
![]() |
A man in a blue sweater and dark jacket looks distressed and scared. He is outdoors in a residential area with trees and buildings in the background. The camera pans smoothly, focusing on his upper body and face from various angles during daylight. The scene captures his intense emotions and movements realistically. |
![]() |
![]() |
![]() |
![]() |
A man in a dark, workshop-like room, explaining something passionately. Various tools and shoes are on the shelves and workbench. The lighting is dim, creating a serious atmosphere. The camera pans right and then moves forward, focusing on him from a medium close-up shot. |
![]() |
![]() |
![]() |
![]() |
A ballerina in a black dress with a feathered bodice and a male dancer in a dark, patterned outfit. Detailed stone wall backdrop with ornate metalwork, dim lighting. Emotional and synchronized dance with spins, lifts, arm movements. Dramatic and gothic atmosphere. |
![]() |
![]() |
![]() |
![]() |
Given a style image and a short prompt, we compared the video generation results with short / structured prompt.
Short Prompt | Controls | Generated Video w/ short prompt | Generated Video w/ structured prompt |
---|---|---|---|
A lighthouse is beaming across choppy waters. |
![]() |
![]() |
![]() |
A little girl is reading a book in the beautiful garden. |
![]() |
![]() |
![]() |
A street performer playing the guitar. |
![]() |
![]() |
![]() |
@inproceedings{wu2025Any2Caption,
title={Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation},
author={Shengqiong Wu and Weicai Ye and Jiahao Wang and Quande Liu and Xintao Wang and Pengfei Wan and Di Zhang and Kun Gai and Shuicheng Yan and Hao Fei and Tat-Seng Chua2},
booktitle={arxiv},
year={2025}
}