Unified Vision–Language Models (UVLMs) aim to advance multimodal learning by supporting both understanding and generation within a single framework. However, existing approaches largely focus on architectural unification while overlooking the need for explicit interaction between the two capabilities during task solving. As a result, current models treat understanding and generation as parallel skills rather than synergistic processes. To achieve real synergy, we introduce the interleaved Analyzing–Drafting problem-solving loop (AD-Loop), a new think paradigm that dynamically alternates between analytic and drafting operations. By interleaving textual thoughts with visual thoughts, AD-Loop enables models to iteratively refine both comprehension and outputs, fostering genuine synergy. To train this mechanism, we design a two-stage strategy: supervised learning on interleaved thought data to initialize alternation, followed by reinforcement learning to promote adaptive and autonomous control. Extensive experiments demonstrate that AD-Loop consistently improves performance across standard benchmarks for both understanding and generation, with strong transferability to various UVLMs architectures. Visual analyses further validate the effectiveness of implicit visual thoughts. These results highlight AD-Loop as a principled and broadly applicable strategy for synergizing comprehension and creation.
Currently, existing explorations of synergy have predominantly focused on improving architectural design to unify the two abilities, while overlooking a crucial point: during task solving, the understanding and generation modules often lack close and explicit interaction, thereby failing to realize genuine synergy between comprehension and generation.
Figure 1: Comparison of existing mechanisms for synergizing understanding and generation, including isolated learning where the two abilities are trained independently, dual learning which leverages cross-modal reconstruction for mutual supervision, and co-learning which jointly optimizes both tasks with paired samples.
To illustrate this, when a user’s instruction is ambiguous, an understanding module could first propose several plausible candidate solutions for a question, then invoke generation to produce sketches or key visualizations as a means of “verifying” these candidates, ultimately yielding the correct answer. Conversely, once the generation module produces initial results, it could query the understanding module for high-level guidance, such as attributes or reasonable spatial layouts, to progressively refine the output (see Figure 2). This motivates a new perspective: instead of treating understanding and generation as co-existing skills, we argue they should be interleaved in a problem-solving loop.
Figure 2: Overall illustration of the multimodal encoding framework.
Building upon a UVLM, we model the synergetic understanding and generation thinking process as an interleaved analyzing–drafting problem-solving loop. Given an input, the model alternates between analyzing (producing textual thoughts) and drafting (producing visual thoughts) before delivering the final output. To achieve this, we design a two-stage training pipeline: Stage 1 performs supervised training to imitate interleaved thinking, and Stage 2 leverages reinforcement learning to enable the model to adaptively decide when to invoke analysis or drafting.
Figure 3: Illustration of the interleaved analytic–drafting problem-solving loop, where understanding and generation interact synergistically to yield accurate solutions.
As shown in Figure 4, the observations align well with our expectations: latent visual thoughts encode semantically coherent information while preserving coarse pixel-level structures. This allows the model to recover the overall contours of the original image and to unify conceptually similar regions. For example, in case (4), distinct regions depicting watermelons and lemons are consistently represented by the same latent token, reflecting their shared conceptual category.
Figure 4: Examples of latent visual thoughts. Each case shows the original image (left) and the corresponding visual thoughts (right), capturing abstract visual structures.
As shown in Figure 5, integrating AD-Loop thoughts improves performance across a wide range of questions, with pronounced gains in spatial and mechanistic reasoning. Fine-grained trends show preferential activation for rotation, complex OCR, and 3D perception, while usage drops for tables, sequences, and symbolic reasoning, where text-only chains suffice. These patterns indicate our adaptive policy that selectively invokes visual thoughts when they offer the greatest benefit.
Figure 5: Performance across skills and capabilities on the LogicVista dataset, comparing models with and without visual thoughts, alongside the proportion of visual-thought usage.
As shown in Figure 6, with the raw prompt alone, the model tends to generate superficial, semantically shallow outputs. Adding self-think produces more detailed descriptions, yet still overly abstract and often misaligned with user intent. By contrast, interleaved thoughts guide faithful, detail-oriented outputs (e.g., correct wheels/screens). Finally, filtering interleaved traces to text-only frequently reintroduces errors (e.g., lighting/positioning), underscoring the necessity of visual thoughts for high-fidelity controllability.
Figure 6: Qualitative comparison: original prompt (left), self-think mode, interleaved thoughts, and text-only thoughts filtered from the interleaved thoughts (right).[V-T] means latent visual thoughts.
@article{wu2026synergizing,
title={Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking},
author={Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng Yan, Hao Fei, Tat-Seng Chua},
journal={arXiv preprint arXiv:2602.21435},
year={2026}
}