USG

Abstract

Scene graph (SG) representations can neatly and efficiently describe scene semantics, which has driven sustained intensive research in SG generation. In the real world, multiple modalities often coexist, with different types, such as images, text, video, and 3D data, expressing distinct characteristics. Unfortunately, current SG research is largely confined to single-modality scene modeling, preventing the full utilization of the complementary strengths of different modality SG representations in depicting holistic scene semantics. To this end, we introduce Universal SG (USG), a novel representation capable of fully characterizing comprehensive semantic scenes from any given combination of modality inputs, encompassing modality-invariant and modality-specific scenes, as shown in Fig. 1. Further, we tailor a niche-targeting USG parser, USG-Par, which effectively addresses two key bottlenecks of cross-modal object alignment and out-of-domain challenges. We design the USG-Par with modular architecture for end-to-end USG generation, in which we devise an object associator to relieve the modality gap for cross-modal object alignment. Further, we propose a text-centric scene contrasting learning mechanism to mitigate domain imbalances by aligning multimodal objects and relations with textual SGs. Through extensive experiments, we demonstrate that USG offers a stronger capability for expressing scene semantics than standalone SGs, and also that our USG-Par achieves higher efficacy and performance.

Figure 1: Illustrations of SGs (top) of single modalities in text, image, video, and 3D, and our proposed Universal SG (bottom). Note that the USG instance shown here is under the combination of four complete modalities, while practically any modality can be absent freely.

USG Definition

Here, we provide a detailed description of the nodes and edges in the USG. The USG is formally represented as $$ \mathcal{G}^{\mathcal{U}} = \{\mathcal{O}, \mathcal{R}\}, \text{where} \;\; \mathcal{O} = \{\mathcal{O}^{*}\}, * \in \{\mathcal{I}, \mathcal{V}, \mathcal{D}, \mathcal{S} \} $$ represents the set of objects across all modalities. Each node involves a category label $ c_i^o \in \mathbb{C}^{\mathcal{O}} $ and a segmentation mask $ m_i $. For instance, as illustrated in Fig. 2 the objects node set $\mathcal{O}$ in the USG comprises of textual objects node set $ \mathcal{O}^{\mathcal{S}}$ in the TSG and visual objects node set $\mathcal{O}^{\mathcal{I}}$ in the ISG. $$ \mathcal{R} = \{\mathcal{R}^{*}, \mathcal{R}^{* \times \diamond}\} , *, \diamond \in \{\mathcal{I}, \mathcal{V}, \mathcal{D}, \mathcal{S} \} \;\; \text{and} \;\; * \ne \diamond $$ $ \mathcal{R}^{*} $ includes both intra-modality relationships and inter-modality associations $\mathcal{R}^{* \times \diamond}$. We define the existence of inter-modality associations between objects from different modalities if they correspond to the same underlying object described in distinct modalities. For example, as shown in Fig. 2, the textual object Peter in the TSG should correspond to the visual object person in the ISG. Similarly, as depicted in Fig. 3 the sofa in the 3DSG aligns with the sofa in the ISG. When inter-modality associations exist, the corresponding objects are merged into a unified node, as shown in Fig. 2, with the example of the headphones This merged node represents the object across multiple modalities, retaining a single category label. Typically, the object name from the textual modality is prioritized for its flexibility and precision in description. Similarly, the relation predicate is preferentially adopted from the TSG, as it often provides a more descriptive and accurate representation. For instance, in Fig. 2, the relationship between Peter and sofa in the USG is relax on derived from the TSG, rather than lying which might be less descriptive. Despite merging nodes, the segmentation masks from all modalities are preserved. This ensures that each modality's unique contribution to the object's representation is maintained within the USG. In addition, to parse the USG for scenes derived from video and other modalities, we first establish association relations between nodes from other modalities and the objects in each frame of the VSG. For instance, as illustrated in Fig. 4 the objects Peter, sofa and iPhone from the TSG are associated with the objects in every frame of the VSG. To ensure the USG comprehensively represents the scene described by the video and other modalities, the scene from the other modalities is added as the first frame in the USG. The remaining frames correspond to the frame-level scene graph representations from the VSG. This paradigm advances in integrating multimodal information much more seamlessly, enriching the holistic representation of the scene within the USG framework.

T-I-USG — **Figure 2:** Illustration of USG generated from text and image scenes.

T-3D-USG — **Figure 3:** Illustration of USG generated from text and 3D scenes.

Figure 4: Illustration of USG generated from text and video scenes.

Figure 5: Illustration of USG generated from text, image and 3D scenes.

Method

Our model consists of five main modules, as shown in Fig. 6
.

First, we first extract the modality-specific features with a modality-specific backbone.
Second, we employ a shared mask decoder to extract object queries for various modalities. These object queries are then fed into the modality-specific object detection head to obtain the category label and tracked positions of the corresponding objects.
Third, the object queries are input into the object associator, which determines the association relationships between objects across modalities.
Fourth, a relation proposal constructor is utilized to retrieve the most confidential subject-object pairs.
Finally, a relation decoder is employed to decode the final predicate prediction between the subjects and objects.

Figure 6: Overview of USG-Par architecture. It mainly consists of five modules, including modality-specific encoders, shared mask decoder, object associator, relation proposal constructor, and relation decoder.

**Figure 7:** Illustration of the object associator for establishing associations between different modalities.

**Figure 8:** Illustration of the object-level and relation-level text-centric scene contrasting learning mechanism.

Training

This section elaborates on the training objectives and strategies to optimize our system.

Object Detection Loss. During training, we first apply Hungarian matching between the predicted and ground-truth entity masks to assign object queries to entities in text, video, image, and 3D modalities. This assignment is then used to supervise the mask predictions and category label classifications.
Object Association Loss. To optimize the object associator, we take the ground-truth association matrix, which is a binary matrix, as the supervised signal.
Relation Classification Loss. For relation predicate classification, we employ a sigmoid CE loss for the predicate classification, similar to object category classification.
Text-centric Scene Contrastive Learning. A text-centric scene contrastive learning approach aligns other modalities with text data, attributed with two key advantages: 1) TSG data encompass the most diverse and general domain, and 2) binding information from other modalities into text effectively addresses the scarcity of USG data for certain modality combinations.

Quantitative Analysis

**Table 1:** Evaluation on the PSG under the SGDet task.

**Table 2:** Evaluation results on the 3DDSG dataset.

**Table 3:** Evaluation on the PVSG dataset.

TSG — **Table 4:** Performance on the FACTUAL dataset.

BibTeX

@inproceedings{wu2025usg,
    title={Universal Scene Graph Generation},
    author={Wu, Shengqiong and Fei, Hao and Chua, Tat-Seng},
    booktitle={CVPR},
    year={2025}
}