Any2Caption : Interpreting Any Condition to Caption for Controllable Video Generation

1Kuaishou Technology   2National University of Singapore  
(*Work done during internship at Kuaishou Technology. Correspondence)

Abstract

To address the bottleneck of accurate user intent interpretation within current video generation community, we present Any2Caption, a novel framework for controllable video generation from any condition. The key idea is decoupling various condition interpretation steps from the video synthesis step. By leveraging modern multimodal large language models (MLLMs), Any2Caption interprets diverse inputs—text, images, videos, and specialized cues such as region, motion, and camera poses—into dense, structured captions that offer backbone video generators with better guidance. We also introduce Any2CapIns, a large-scale dataset with 337K instances and 407K conditions for any-condition-to-caption instruction tuning. Comprehensive evaluations demonstrate significant improvements of our system in controllability and video quality across various arts of existing video generation models.

Teaser

Figure 1: Any2Caption, an efficient and versatile framework for interpreting diverse conditions to structured captions, which then can be fed into any video generator to generate highly controllable videos.

Method

To facilitate the any-to-caption instruction tuning for Any2Caption, we construct Any2CapIns, a large-scale dataset that converts a concise user prompt and diverse non-text conditions into detailed, structured captions. Concretely, the dataset encompasses four main categories of conditions: depth maps, multiple identities, human poses, and camera poses. Through extensive manual labeling combined with automated annotation by GPT-4V, followed by rigorous human verification, we curate a total of 337K high-quality instances, with 407K condition annotations, with the short prompts and structured captions averaging 55 and 231 words, respectively

Teaser

Figure 2: The pipeline for constructing the Any2CapIns dataset involves three key steps: 1) data collection, 2) structured video caption generation, and 3) user-centric short prompt generation.

we propose Any2Caption, an MLLM-based universal condition interpreter designed not only to handle text, image, and video inputs but also equipped with specialized modules for motion and camera pose inputs. As illustrated in Fig. 3, Any2Caption takes as inputs any/diverse condition (or combination), and produces a densely structured caption, which is then passed on to any backend video generator for controllable, high-quality video production.

Teaser

Figure 3: Architecture illustration of Any2Caption, where Qwen2-LLM serves as the backbone and is paired with text, image, video, motion, and camera encoders to produce structured captions.

Visulizations


• Short Caption + Controls → (Structure Caption) → Video Generation

To further showcase the model's capacity to understand and leverage input conditions, we directly feed the structured captions—derived from our model's interpretation—into downstream text-to-video generation systems (e.g., CogvideoX and Hunyuan). Here, we show the given short prompt and controllable condtions, and the generated videos only based on the structured captions.

Short Prompt Controls Structured Caption Generated Video

(Hunyuan) A woman in red dress walks in a minimalist, modern room. She is holding two mugs and looks slightly displeased. The room has natural light.

1. Dense Caption: A woman in a red dress is seen walking through a modern, well-lit interior space. She carries two white mugs—one in each hand—as she moves from a counter area with a coffee machine and pastries toward a windowed section of the room. Her expression suggests she may be in thought or reacting to something unseen. The environment features minimalist design, including white walls and a large black-and-white photo. Natural light streams through the windows, giving the space an open and airy feel.
2. Main Object Caption: The main subject is a middle-aged woman with long, light brown hair and a fair complexion. She wears a sleeveless, form-fitting red dress with a v-neckline and subtle patterning. She carries two white mugs and appears slightly tense or concerned, as shown by her pursed lips and furrowed brows. Her gaze shifts across the room, as though she is searching or reacting to her surroundings.
3. Background Caption: The setting is a sleek, modern room with a minimalist aesthetic. It features clean white walls, a prominent black-and-white photograph of a building, and a countertop with a coffee machine and an arrangement of pastries. The space is brightly lit by natural light from large windows, and the scene transitions toward a section with a broader view of an adjacent office area.
4. Camera Caption: The camera tracks the woman as she walks to the left of the frame. It starts with a slightly low angle and maintains a medium close-up, focusing on her upper body. As she moves, the shot smoothly shifts to a full side profile, keeping her centered within the frame throughout the sequence.
5. Style Caption: The video adopts a clean, polished, and professional visual style. The environment and subject are sharply defined, with bright, even lighting that enhances the modern aesthetic.
6. Action Caption: A woman in a red skirt walked forward with two cups in her hands, and then turned to look at the right.

(Hunyuan) A young man carrying a messenger bag runs down a narrow, cobblestone street filled with sandbags and crates, suggesting a wartime.

1. Dense Caption: A young man is seen running down a narrow cobblestone street, flanked by sandbags and crates that evoke a wartime setting. Dressed in a brown coat and carrying a messenger bag, he moves quickly through a corridor of old European-style buildings. The camera follows his movements, capturing both his hurried pace and the tense atmosphere around him. The scene conveys a strong sense of urgency and determination, set against a quiet, war-ravaged backdrop.
2. Main Object Caption: The central figure is a young man, likely in his late teens or early twenties, with short dark hair and a light complexion. He is dressed in a brown coat and carries a messenger bag across his body. His facial expression is focused and intense, eyes fixed straight ahead, and his mouth slightly open as though he is breathing heavily. His movements are rapid and deliberate, reinforcing the urgency of the situation.
3. Background Caption: The background consists of a narrow cobblestone street lined with historic European-style buildings. Sandbags and wooden crates are scattered along the street, reinforcing the wartime context. The sky is mostly clear with a few scattered clouds, suggesting fair weather. Throughout the video, the background remains unchanged, directing attention to the man's movement.
4. Camera Caption: The camera tracks the man from a consistent eye-level perspective, maintaining a medium distance that captures his full-body movement. As he runs through the street, the camera keeps pace with him, framing both his actions and the surrounding setting to convey a sense of immersion and tension.
5. Style Caption: The video adopts a realistic, documentary-style approach that emphasizes the tension and immediacy of the moment. The contrast between the man's dynamic movement and the still, deserted surroundings intensifies the feeling of a war-torn environment and personal urgency.
6. Action Caption: The man runs swiftly down the street, his footsteps quick and his coat and messenger bag fluttering as he moves. His pace is steady and purposeful, conveying urgency. The static background further highlights the motion of his figure.

(Hunyuan) A man is adjusting his cap and looking around occasionally. Surroundings include a suburban neighborhood with a brick house, a white van, and an American flag. The weather is sunny, with trees and a clear blue sky. The man seems slightly frustrated, talking to the camera with natural lighting.

1. Dense Caption: A man dressed in a white shirt and a greenish baseball cap is seen standing outdoors in a quiet suburban neighborhood. He occasionally adjusts his cap and neck while looking around his surroundings. The footage captures him from a close-up angle, highlighting his facial expressions and upper body. In the background, a white van, a brick house, and an American flag are visible, suggesting a bright and sunny day.
2. Main Object Caption: The main subject is a light-skinned man, likely in his late 20s or early 30s, with blond hair, a short beard, and blue eyes. He wears a teal and greenish baseball cap, reflective blue sunglasses, and a white long-sleeved shirt. He has a medium build and appears to be speaking directly to the camera. His expressions, such as a furrowed brow and pursed lips, convey a tone of slight frustration or annoyance, suggesting he may be explaining or commenting on something.
3. Background Caption: Set in a suburban neighborhood, the background features a white van parked near a brick house, with an American flag displayed prominently. The scene takes place on a sunny day under a clear blue sky, with trees casting partial shade across the area. 4. Camera Caption: The video adopts a casual, candid visual tone with the use of natural lighting, lending an authentic, unfiltered atmosphere to the scene.
5. Style Caption: The video is shot with a handheld camera, positioned at about eye level with the subject. It maintains a close-up framing, focusing on the man's head and shoulders. The camera pans gently—first left, then right, repeating this pattern—capturing the subject’s profile while keeping him centered in the frame, creating an intimate, conversational perspective.
6. Action Caption: The man, dressed in white, is seen speaking while scratching his hand. He then turns his head to the right, maintaining subtle but expressive body language throughout.

(CogVideoX) A serene winter backyard with snow-covered ground and bare trees, revealing a blue shed with a white garage door and a doghouse.

1. Dense Caption: The video presents a tranquil backyard scene set during winter. It opens with a view of a wooden deck in the foreground before the camera gradually pans to reveal a small blue shed with a white garage door, accompanied by a doghouse to its left. In the center of the frame, a red slide stands out against the snow-covered ground. Surrounding the area are bare trees, reinforcing the cold and stillness of the season. The scene is calm and static, with no visible movement of people or animals, highlighting the peacefulness of a quiet winter day.
2. Main Object Caption: The primary objects in the scene include a blue shed with a white garage door, a doghouse positioned to its left, a red plastic slide, and a wooden deck. These elements are stationary, emphasizing the stillness of the wintry environment.
3. Background Caption: The backdrop features a snow-blanketed backyard enclosed by a white fence, with leafless trees and distant neighboring houses visible. The overcast sky and the untouched snow convey a cold, silent winter atmosphere. The background remains unchanged throughout the video, adding to the sense of stillness.
4. Camera Caption: Filmed from a high-angle perspective, the camera advances forward while gently panning to the right. It maintains a wide shot, encompassing both the house and the surrounding snowy ground, offering a comprehensive view of the serene setting.
5. Style Caption: The video adopts a realistic and unembellished style, authentically portraying the quietude of a winter day.
6. Action Caption: The camera moves forward slowly, capturing the snow-covered ground as well as the house in the distance, contributing to the gradual reveal of the scene.

(CogVideoX)A man gestures while the woman listens. They sit in a sunny park. The camera captures close-up shots of their heads and shoulders.

1. Dense Caption: Two individuals are seated outdoors on a sunny day, engaged in a relaxed conversation. The woman, dressed in a blue and white patterned outfit and wearing a wide-brimmed hat, sits on the left, while the man, in a dark shirt and sunglasses, is on the right. They appear to be in a tranquil outdoor environment, such as a park or garden, with green grass and trees surrounding them. As they talk, the man occasionally gestures, and the woman listens with interest.
2. Main Object Caption: The main subjects include a woman in her late 20s or early 30s with long blonde hair styled in a braid, light skin, and light-colored eyes. She wears a wide-brimmed brown hat, a white and blue patterned top, a necklace, and large hoop earrings. The man, also in his late 20s or early 30s, has light skin, short brown hair, a light beard, and wears a dark gray t-shirt along with dark sunglasses. The woman is seen smiling while speaking, as the man listens with a slight smile, indicating an engaging and friendly exchange.
3. Background Caption: The setting is a sunny, peaceful outdoor area with green grass, leafy trees, and a wooden structure resembling a gazebo or pavilion. The clear sky and natural lighting contribute to the pleasant atmosphere. The background remains still throughout the video, focusing attention on the interaction between the two people.
4. Camera Caption: The footage is slightly shaky, suggesting it was captured handheld. The camera is positioned at eye level with the subjects, maintaining a close-up view of their heads and shoulders. The woman is filmed from a front-side angle, positioned on the left side of the frame, while the man is shown in profile on the right. This framing keeps the conversation central and immersive.
5. Style Caption: The visual style is casual and candid, reflecting a spontaneous moment captured in a calm, outdoor setting. The atmosphere is natural and relaxed, emphasizing genuine human interaction.
6. Action Caption:The woman, wearing a brown hat, begins by talking toward the camera before turning to face the man seated to her right, who is dressed in a dark outfit. Her movement is subtle and conversational.

• IDs to Video Generation

Given multiple identities, and a short prompt, we compared the video generation results with short / structured prompt.

Short Prompt Controls Generated Video w/ short prompt Generated Video w/ short prompt + Condition Caption Generated Video w/ structured prompt

A man in a dark suit stands outdoors, initially looking distressed. The camera captures a close-up of his face, showing a bandage on his forehead. Suddenly, he shifts his stance urgently, turns his head to the side, and raises a handgun with determination. The setting is plain, possibly showing a clear sky, and the lighting suggests it's daytime.

A young girl wearing a school uniform and a young man in casual clothes are walking side by side along a dimly lit concrete wall at night. The girl walks on the left while the boy rides a bicycle on the right. The background is urban and gritty, with warm, moody lighting. The camera follows them closely, capturing a medium close-up shot of their upper bodies from different angles as they move. The scene has a nostalgic and contemplative atmosphere.

A woman in a skirt dances on the snow with her hair flying.

A whale is floating in the girl's palm, and the camera gradually zooms in.

• Camera to Video Generation

Given a camera trajectory and a short prompt, we compared the video generation results with short / structured prompt.

Short Prompt Controls Generated Video w/ short prompt Generated Video w/ structured prompt

A serene video of a large house with a red roof and a spacious porch, surrounded by lush greenery. A peaceful countryside setting with vibrant colors and a tranquil atmosphere.

A well-lit dining and living room with elegant and classic decor. The dining table is surrounded by chairs and has a chandelier above it. There's a wooden cabinet against the wall. The background features a hallway with a staircase and another dining area visible. The decor includes wooden furniture and framed pictures on the walls.

The scene is bathed in bright sunlight, emphasizing the warm and inviting atmosphere. A modern house with large windows and a balcony is showcased. Potted plants accent the architectural details. Distant mountains frame the view. Lush greenery surrounds the scene. The sky is a clear blue, dotted with scattered clouds. A dreamy lens flare effect adds to the serene quality. The overall ambiance is tranquil and picturesque.

• IDs+Depth to Video Generation

Given a depth sequence, multiple identities, and a short prompt, we compared the video generation results with short / structured prompt.

Short Prompt Controls Generated Video w/ short prompt Generated Video w/ structured prompt

Young woman with cat ears holding a white cat in a sunlit meadow. Blue butterflies flutter around. Gentle caresses and a serene, magical atmosphere. Trees cast a warm glow. Fixed camera at eye level capturing upper body. Whimsical, enchanting style.

A young girl dances in a bright, colorful room. A fluffy dog joyfully mimics her moves. The room features unique, colorful furniture with large windows letting in natural light. The dance is lively and expressive. The atmosphere is cheerful and vibrant. The camera captures full body movements at eye level.

A man in a blue sweater and dark jacket looks distressed and scared. He is outdoors in a residential area with trees and buildings in the background. The camera pans smoothly, focusing on his upper body and face from various angles during daylight. The scene captures his intense emotions and movements realistically.

A man in a dark, workshop-like room, explaining something passionately. Various tools and shoes are on the shelves and workbench. The lighting is dim, creating a serious atmosphere. The camera pans right and then moves forward, focusing on him from a medium close-up shot.

A ballerina in a black dress with a feathered bodice and a male dancer in a dark, patterned outfit. Detailed stone wall backdrop with ornate metalwork, dim lighting. Emotional and synchronized dance with spins, lifts, arm movements. Dramatic and gothic atmosphere.

• Style to Video Generation

Given a style image and a short prompt, we compared the video generation results with short / structured prompt.

Short Prompt Controls Generated Video w/ short prompt Generated Video w/ structured prompt

A lighthouse is beaming across choppy waters.

A little girl is reading a book in the beautiful garden.

A street performer playing the guitar.

BibTeX

@inproceedings{wu2025Any2Caption,
    title={Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation},
    author={Shengqiong Wu and Weicai Ye and Jiahao Wang and Quande Liu and Xintao Wang and Pengfei Wan and Di Zhang and Kun Gai and Shuicheng Yan and Hao Fei and Tat-Seng Chua2},
    booktitle={arxiv},
    year={2025}
}