Shengqiong Wu

About Me

I am currently a Postdoctoral Researcher at Department of Computer Science, University of Oxford, working with Prof. Michael Wooldridge. Prior to this, I obtained my Ph.D. degree at NExT++ Research Center, advised by Prof. Tat-Seng Chua in School of Computing, National University of Singapore. I received both my M.S. and B.S. degrees from Wuhan University.

My research focuses on Large Vision-Language Foundation Models, particularly their capability, controllability, evaluability, and robust, reliable reasoning. I am also actively working on multi-agent systems and their integration with foundation models. More broadly, I am interested in Natural Language Processing and intelligent agent research.

I am always happy to discuss potential collaborations — please feel free to drop me an email.

Some of my representive work:

	NExT-GPT: The first unified any-to-any multimodal LLM, capable of understanding and generating across any modality or combination of modalities (e.g., text, image, video, audio). [PDF] [Github] [Huggingface] [Video] (ICML'24 Oral, selected as a Most Influential Paper by Paper Digest, WAIC Youth Outstanding Paper Award, , )
	Any2Caption: A SoTA framework for controllable video generation from any condition by being the first to leverage MLLMs to interpret diverse inputs into dense, structured captions. [PDF] [Github] [Huggingface] [Video] (Preprint, 2025)
	Setok: The first to propose a general dynamic semantic-equivalent vision tokenizer, fundamentally enhancing the performance bottlenecks of existing MLLMs. [PDF] [Github] (ICLR'25)
	USG: The first to propose a Universal Scene Graph representation framework that unifies structured semantic scene graphs across modalities including images, text, videos, and 3D. [PDF] [Github] (CVPR'25 Highlight)

🔥NEWs🔥

[NEW!-2026/05]🚀 I am co-organizing Joint Audio-Video CG Workshop @ ACM MM 2026, welcome submissions!

[NEW!-2026/05]🚀 We are excited to release what we believe is the first comprehensive survey on Audio-Visual Intelligence (AVI) in the era of large foundation models!!

[NEW!-2026/01]🚀 I'm co-organizing the Workshop on ANY-TO-ANY MULTIMODAL LEARNING at CVPR 2026. Welcome Submissions and Presentations!

[NEW!-2026/01]🚀 I maintain a github repo focusing on ANY-TO-ANY MULTIMODAL GENERATION. Welcome to check it out, contribute, and share any resources or insights

[NEW!-2025/11]🚀 We develop a Universal Video Agent towards Open-Source Next-Generation Video Generalist.

[NEW!-2025/10]🥳🥳🥳 I'm co-organizing the Workshop on Scene Graph on Structured Intelligence at WACV 2026. Welcome Submissions and Presentations!🤓

[NEW!-2025/06] 🚀 Just launched an open discussions on the future of scene graphs and scene understanding. Let us know what you think in the discussion thread!

[NEW!-2025/06] 📣 We host a grand Challenge in MUCG Workshop, welcome to partcipate. ✨ Top teams will be awarded certificates + cash prizes!

[NEW!-2025/05] 🚀📣 Excited to announce that we're organizing the 1st MUCG Workshop: MLLM for Unified Comprehension and Generation at ACM MM 2025!

[NEW!-2025/05] 🥳 ICML'25 Spotlight paper: Path to Multimodal Generalist: General-Level and General-Bench . A New Evaluation Paradigm for multimodal generalists.

[NEW!-2025/03] We release the first survey on MM-CoT reasoning, welcom to participate and star✨✨.

[NEW!-2025/03] 🐾 I will be a volunteer at ICLR 2025. One paper (Setok) is accpted by ICLR-25. Looking forward to meet you.

[NEW!-2024/02] 🥳🥳🥳 My two full papers are accpted in CVPR 2025: Universal Scene Graph Generation and 4D Scene Graph Generation .

[NEW!-2025/02] Welcome to SSNLP-25, Free Registration!!!

[NEW!-2024/12] 🥳 I maintain a github repo focusing on Awesome-Scene-Graph-Generation-and-Application. Welcome to check it out, contribute, and share any resources or insights related to scene graph generation!

[NEW!-2024/12] One paper focusing on mitigating hallucination in MLLMs is accepted by AAAI-25!

[NEW!-2024/08] 🐾I'm excited to have the opportunity to be a volunteer at ACL 2024. Looking forward to being part of it!

[NEW!-2024/08] I have tried to build NExT-GPT in PaddlePaddle, PaddleNLP and PPDiffusers. Still working on it. 🤔🤔

[NEW!-2024/05] Congratulations! 🥳🥳🥳, two full papers are accpted in ICML-24, NExT-GPT and Video-of-Thought .

[NEW!-2024/04] We release Vitron (Demo, Paper , Code), a universal pixel-level vision LLM designed for understanding, generating, segmenting, editing of both image and video. 🌟🌟

[NEW!-2024/02] One full paper is accepted by CVPR-24 about Text-to-Video Generation. Congrats to all my co-authors.

[NEW!-2024/01] I honor Baidu Scholarship (10 people worldwide). 🥳🥳🥳

[NEW!-2023/12] I have successfully passed my Rearcsh-based QE, I am now a Ph.D. Candidate. 🥳🥳🥳

[NEW!-2023/11] I'll join Kunlun 2050 Research as a research intern, advised by Prof. Yan. 😆

[NEW!-2023/10] We build NExT-GPT, a general-purpose any-to-any MLLM. 🌟

[NEW!-2023/09] One full paper is accepted in NeurIPS-23, about Intricate Text-to-image Generation based on Scene Graph.

[NEW!-2023/07] One full paper is accepted in ACM MM-23, about High-faithful Text-to-image Generation enhanced with layout planning from LLM.

[NEW!-2023/05] Two full papers is accepted in ACL-23, about Multimodal Relation Extraction and Image Captioning.

[NEW!-2022/07] I'm heading to SoC NUS to pursue my PhD—a new journey and fresh challenges await! 🙃

Publications

2026

You Qin, Kai Liu, Shengqiong Wu, Kai Wang, Shijian Deng, Yapeng Tian, Junbin Xiao, Yazhou Xing, Yinghao Ma, Bobo Li, Roger Zimmermann, Lei Cui, Furu Wei, Jiebo Luo, Hao Fei. Audio-Visual Intelligence in Large Foundation Models: A Comprehensive Survey. Arxiv. 2026. [pdf][Paper List]
Shengqiong Wu, Bobo Li, Xinkai Wang, Xiangtai Li, Lei Cui, Furu Wei, Shuicheng YAN, Hao Fei, Tat-Seng Chua. Synergizing Understanding and Generation with Interleaved Analyzing-Drafting Thinking. ICLR. 2026. [pdf]
Jundong Xu, Hao Fei, Huichi Zhou, Xin Quan, Qijun Huang, Shengqiong Wu, William Yang Wang, Mong-Li Lee, Wynne Hsu. LogicReward: Incentivizing LLM Reasoning via Step-Wise Logical Supervision. ICLR. 2026. [Project][pdf]
Kai Liu, Wei Li, Lai Chen, Shengqiong Wu, Yanhao Zheng, Jiayi Ji, Fan Zhou, Jiebo Luo, Ziwei Liu, Hao Fei, Tat-Seng Chua. JavisDiT: Joint Audio-Video Diffusion Transformer with Hierarchical Spatio-Temporal Prior Synchronization. ICLR. 2026. [Project][pdf]
Kai Liu, Yanhao Zheng, Kai Wang, Shengqiong Wu, Rongjunchen Zhang, Jiebo Luo, Dimitrios Hatzinakos, Ziwei Liu, Hao Fei, Tat-Seng Chua. JavisDiT++: Unified Modeling and Optimization for Joint Audio-Video Generation. ICLR. 2026. [Project][pdf]

2025

Shengqiong Wu, Weicai Ye, Yuanxing Zhang, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Kun Gai, Hao Fei, Tat-Seng Chua. A Reason-then-Describe Instruction Interpreter for Controllable Video Generation. arxiv. 2025. [Project][pdf]
Zhengyang Liang, Daoan Zhang, Huichi Zhou, Rui Huang, Bobo Li, Yuechen Zhang, Shengqiong Wu, Xiaohan Wang, Jiebo Luo, Lizi Liao, Hao Fei. UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist. arxiv. 2025. [Project][pdf]
Shengqiong Wu, Weicai Ye, Jiahao Wang, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Kun Gai, Shuicheng Yan, Hao Fei, Tat-Seng Chua. Any2Caption: Interpreting Any Condition to Caption for Controllable Video Generation. arxiv. 2025. [Project][pdf]
Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou,et al. On Path to Multimodal Generalist: General-Level and General-Bench. ICML. 2025. [Project][pdf][Huggingface]
Yaoting Wang, Shengqiong Wu, Yuechen Zhang, William Wang, Ziwei Liu, Jiebo Luo, Hao Fei. Multimodal Chain-of-Thought Reasoning: A Comprehensive Survey. arxiv. 2025. [Code][pdf]
Shengqiong Wu, Hao Fei, Tat-Seng Chua, Shuicheng Yan. Universal Scene Graph Generation. CVPR. 2025. [Code][pdf]
Shengqiong Wu, Hao Fei, Jingkang Yang, Xiangtai Li, Juncheng Li, Hanwang Zhang, Tat-seng Chua. Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene. CVPR. 2025. [Code][pdf]
Shengqiong Wu, Hao Fei, Xiangtai Li, Jiayi Ji, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan. Towards Semantic Equivalence of Tokenization in Multimodal LLM. ICLR. 2025. [Code][pdf]
Shengqiong Wu, Hao Fei, Liangming Pan, William Yang Wang, Shuicheng Yan, Tat-Seng Chua. Combating Multimodal LLM Hallucination via Bottom-up Holistic Reasoning. In Proceedings of AAAI. 2025. [pdf]

2024

Hao Fei, Shengqiong Wu, Hanwang Zhang, Tat-Seng Chua, Shuicheng Yan. VITRON: A Unified Pixel-level Vision LLM for Understanding, Generating, Segmenting, Editing. In Proceedings of NeurIPS. 2024. [Code][pdf]
Meng Luo, Hao Fei*, Bobo Li, Shengqiong Wu, Qian Liu, Soujanya Poria, Erik Cambria, Mong-Li Lee, Wynne Hsu. PanoSent: A Panoptic Sextuple Extraction Benchmark for Multimodal Conversational Aspect-based Sentiment Analysis. In Proceedings of ACM MM. 2024. (Oral).[Code][pdf]
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, Tat-Seng Chua. NExT-GPT: Any-to-Any Multimodal Large Language Model. In Proceedings of ICML. 2024. (Oral) [Code | 3.6k 🌟][pdf]
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Meishan Zhang, Mong-Li Lee, Wynne Hsu. Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition. In Proceedings of ICML. 2024. (Oral) [Code][pdf]
Hao Fei, Shengqiong Wu, Wei Ji, Hanwang Zhang, Tat-Seng Chua. Dysen-VDM: Empowering Dynamics-aware Text-to-Video Diffusion with LLMs. In Proceedings of CVPR. 2024. [Code][pdf]

2023

Shengqiong Wu, Hao Fei, Hanwang Zhang, Tat-Seng Chua. Imagine That! Abstract-to-Intricate Text-to-Image Synthesis with Scene Graph Hallucination Diffusion. In Proceedings of NeurIPS. 2023. (long, poster) [Code][pdf]
Leigang Qu*, Shengqiong Wu*, Hao Fei, Liqiang Nie, Tat-Seng Chua. LayoutLLM-T2I: Eliciting Layout Guidance from LLM for Text-to-Image Generation. In Proceedings of ACM MM. 2023. (*: equal contribution, long) [Code][pdf]
Bobo Li, Hao Fei, Yuhan Wu, Jinsong Zhang, Shengqiong Wu, Jingye Li, Yijiang Liu, Lizi Liao, Tat-Seng Chua, Fei Li, Donghong Ji. DiaASQ: A benchmark of conversational aspect-based sentiment quadruple analysis.In Proceedings of ACL. 2023. (long, poster) [Code][pdf]
Shengqiong Wu, Hao Fei, Yixin Cao, Lidong Bing, Tat-Seng Chua. Information Screening whilst Exploiting! Multimodal Relation Extraction with Feature Denoising and Multimodal Topic Modeling. In Proceedings of ACL. 2023. (long, poster, paper award nomination, 1.6%) [Code][pdf]
Shengqiong Wu, Hao Fei, Wei Ji, Tat-Seng Chua. Cross2StrA: Unpaired Cross-lingual Image Captioning with Cross-lingual Cross-modal Structure-pivoted Alignment. In Proceedings of ACL. 2023. (long, oral) [pdf]

2022

Hao Fei, Shengqiong Wu, Jingye Li, Bobo Li, Fei Li, Libo Qin, Meishan Zhang, Min Zhang, Tat-Seng Chua. LasUIE: Unifying information extraction with latent adaptive structure-aware generative language model. In Proceedings of NeurIPS. 2022. (long, poster) [Code][pdf]
Hu Cao, Jingye Li, Fangfang Su, Fei Li, Hao Fei, Shengqiong Wu, Bobo Li, Liang Zhao and Donghong Ji. OneEE: A One-Stage Framework for Fast Overlapping and Nested Event Extraction. In Proceedings of COLING. 2022. (long, oral) [Code][pdf]
Shengqiong Wu, Hao Fei, Fei Li, Meishan Zhang, Yijiang Liu, Chong Teng, Donghong Ji. Mastering the Explicit Opinion-Role Interaction: Syntax-Aided Neural Transition System for Unified Opinion Role Labeling. In Proceedings of AAAI. 2022. (long, online) [Code][pdf]
Jingye Li, Hao Fei, Jiang Liu, Shengqiong Wu, Meishan Zhang, Chong Teng, Donghong Ji, Fei Li. Unified named entity recognition as word-word relation classification. In Proceedings of AAAI. 2022. (long, online) [Code][pdf]

2021

Shengqiong Wu, Hao Fei, Yafeng Ren, Donghong Ji, Jingye Li. Learn from Syntax: Improving Pair-wise Aspect and Opinion Terms Extraction with Rich Syntactic Knowledge. In Proceedings of IJCAI. 2021. (long, online) [Code][pdf]

Shengqiong WU (Tori)

About Me

🔥NEWs🔥

Publications

Academic Services

Conference Reviewer

Journal Reviewer

Invited Talks

Internships

Selected Honors & Awards

Skills and MISC

Kuaishou Kling Research Advisor: Weicai Ye, Xintao Wang
Kunlun Skywork AI, Singapore 2050 Research Advisor: Shuicheng Yan, Director