The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can largely suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG.
Figure 1: (a) Illustration of 4D-PSG, (b) SG dataset statistics, and (c) motivation for 2D scene transfer learning.
The framework is shown in Fig. 2 (e.g., subfigure step 1). Specifically, given dual inputs of RGB and depth images of the 4D scene, we use ImageBind as the 4D scene encoder for each modality separately, followed by an aggregator to efficiently fuse feature from all modalities. Next, since our scene understanding focuses primarily on object-level and relation-level comprehension and we seek to optimize inference efficiency, we merge the resulting representations spatially and temporally. The merged features are then passed through an MLP projector layer, transforming the embeddings into the language space for LLM comprehension. We instantiate the LLM with LLaMA2, and leverage the LLM to output textual relation triplets sequences for SG generation. Also, we introduce a signal token ``[Obj] '' to trigger object segmentation. Therefore, the output sequence takes the form: ``oi [Obj] rk oj [Obj] ts te''. At the backend, we employ SAM2 as a 3D mask decoder, which takes both the hidden states of the ``[Obj] '' tokens and the original RGB image frames as input. To ensure compatibility with SAM2, a linear projector is applied to first project the hidden states to match the dimensions of SAM2's prompt embedding. The projected hidden states are then used as prompt embeddings for SAM2.
Figure 2: Overview of 2D-to-4D visual scene transfer learning mechanisms for 4D-PSG generation, including 4 key steps.
Figure 3: (a) Illustration of 4D-PSG, (b) SG dataset statistics, and (c) motivation for 2D scene transfer learning.
Figure 4: A case illustrating the prediction of 4D-LLM on 4D-PSG.
@inproceedings{wu2025psg4dllm,
title={Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene},
author={Shengqiong Wu and Hao Fei and Jingkang Yang and Xiangtai Li and Juncheng Li and Hanwang Zhang and Tat-Seng Chua},
booktitle={CVPR},
year={2025}
}