Processing math: 100%

Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene

1National University of Singapore   22Nanyang Technological University   3Zhejiang University

Abstract

The latest emerged 4D Panoptic Scene Graph (4D-PSG) provides an advanced-ever representation for comprehensively modeling the dynamic 4D visual real world. Unfortunately, current pioneering 4D-PSG research can largely suffer from data scarcity issues severely, as well as the resulting out-of-vocabulary problems; also, the pipeline nature of the benchmark generation method can lead to suboptimal performance. To address these challenges, this paper investigates a novel framework for 4D-PSG generation that leverages rich 2D visual scene annotations to enhance 4D scene learning. First, we introduce a 4D Large Language Model (4D-LLM) integrated with a 3D mask decoder for end-to-end generation of 4D-PSG. A chained SG inference mechanism is further designed to exploit LLMs' open-vocabulary capabilities to infer accurate and comprehensive object and relation labels iteratively. Most importantly, we propose a 2D-to-4D visual scene transfer learning framework, where a spatial-temporal scene transcending strategy effectively transfers dimension-invariant features from abundant 2D SG annotations to 4D scenes, effectively compensating for data scarcity in 4D-PSG.

Teaser

Figure 1: (a) Illustration of 4D-PSG, (b) SG dataset statistics, and (c) motivation for 2D scene transfer learning.

Model Architecture

The framework is shown in Fig. 2 (e.g., subfigure step 1). Specifically, given dual inputs of RGB and depth images of the 4D scene, we use ImageBind as the 4D scene encoder for each modality separately, followed by an aggregator to efficiently fuse feature from all modalities. Next, since our scene understanding focuses primarily on object-level and relation-level comprehension and we seek to optimize inference efficiency, we merge the resulting representations spatially and temporally. The merged features are then passed through an MLP projector layer, transforming the embeddings into the language space for LLM comprehension. We instantiate the LLM with LLaMA2, and leverage the LLM to output textual relation triplets sequences for SG generation. Also, we introduce a signal token ``[Obj] '' to trigger object segmentation. Therefore, the output sequence takes the form: ``oi [Obj] rk oj [Obj] ts te''. At the backend, we employ SAM2 as a 3D mask decoder, which takes both the hidden states of the ``[Obj] '' tokens and the original RGB image frames as input. To ensure compatibility with SAM2, a linear projector is applied to first project the hidden states to match the dimensions of SAM2's prompt embedding. The projected hidden states are then used as prompt embeddings for SAM2.

framework

Figure 2: Overview of 2D-to-4D visual scene transfer learning mechanisms for 4D-PSG generation, including 4 key steps.

2D-to-4D Visual Scene Transfer Learning

  • Step 1: 4D Scene Perception Initiation Learning, We begin with performing the initiation learning to enable the LLM to develop a foundational perception of the 4D scene so as to generate 4D SGs.
  • Step 2: 2D-to-4D Scene Transcending Learning As shown in Fig. 2a, this step consists of three subprocesses:
    • Subprocess-a): 2D (RGB) to Depth TranscendingLearning, aiming to optimize the depth estimator Fde to predict the depth features;
    • Subprocess-b): 2D (RGB) Temporal TranscendingLearning, designed to generate a 2D (RGB) temporal sequence features using an RGB Temporal Estimator Frte;
    • Subprocess-c): Depth Temporal Transcending Learning, focusing on training a Depth Temporal Estimator Fdte to yield depth temporal sequence features.
  • Step 3: Pseudo 4D Scene Transfer Initiation Learning, we use a limited amount of 4D data to further refine the transcending module, and also to directly apply the transcended 2D scene features into the full 4D-LLM to interpret and produce the 4D PSG.
  • Step 4: Large-scale Visual Scene Transfer Learning, Following scene transfer initiation learning, we leverage large volumes of 2D visual features (i.e., 2D SGs) to enhance 4D scene understanding for 4D-PSG generation. Specifically, the model takes only 2D scenes as input, which are then transcended into pseudo-4D scenes (cf. Fig. 2c).

transfer-learning

Figure 3: (a) Illustration of 4D-PSG, (b) SG dataset statistics, and (c) motivation for 2D scene transfer learning.

Quantitative Analysis


framework

Figure 4: A case illustrating the prediction of 4D-LLM on 4D-PSG.

BibTeX

@inproceedings{wu2025psg4dllm,
    title={Learning 4D Panoptic Scene Graph Generation from Rich 2D Visual Scene},
    author={Shengqiong Wu and Hao Fei and Jingkang Yang and Xiangtai Li and Juncheng Li and Hanwang Zhang and Tat-Seng Chua},
    booktitle={CVPR},
    year={2025}
}