About

I am a Postdoctoral Researcher in the Department of Computer Science, University of Oxford, working with Prof. Michael Wooldridge. I obtained my Ph.D. at the NExT++ Research Center, advised by Prof. Tat-Seng Chua in the School of Computing, National University of Singapore. I received my M.S. and B.S. degrees from Wuhan University.

My research works toward general multimodal intelligence, currently along the following directions:

I am always happy to discuss potential collaborations — feel free to drop me an email.

News

Representative Work

View all publications →
NExT-GPT — the first unified any-to-any multimodal LLM, able to understand and generate across any modality or combination of modalities (text, image, video, audio).
[PDF] [Code] [HF] [Video]
ICML'24 Oral · Most Influential Paper (Paper Digest) · WAIC Youth Outstanding Paper Award · 1150+ citations
Any2Caption — a SoTA framework for controllable video generation from any condition, the first to leverage MLLMs to interpret diverse inputs into dense, structured captions.
[PDF] [Project] [HF] [Video]
Preprint, 2025
SeTok
SeTok — the first general dynamic semantic-equivalent vision tokenizer, addressing a core performance bottleneck of existing MLLMs.
[PDF] [Code]
ICLR'25
USG
USG — the first Universal Scene Graph representation framework, unifying structured semantic scene graphs across images, text, videos, and 3D.
[PDF] [Code]
CVPR'25 Highlight