SGTR+: End-to-end Scene Graph Generation with Transformer

We introduce two new modules in SGTR+: a spatial-aware predicate node generator and a unified graph assembling module. These modules enhance both efficiency and quality in generating complex scene graphs.

Abstract

Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or suboptimal designs. In this paper, we propose a novel SGG method to address the aforementioned issues, formulating the task as a bipartite graph construction problem. To address the issues above, we create a transformer-based end-to-end framework to generate the entity and entity-aware predicate proposal set, and infer directed edges to form relation triplets. Moreover, we design a graph assembling module to infer the connectivity of the bipartite scene graph based on our entity-aware structure, enabling us to generate the scene graph in an end-to-end manner. Based on bipartite graph assembling paradigm, we further propose a new technical design to address the efficacy of entity-aware modeling and optimization stability of graph assembling. Equipped with the enhanced entity-aware design, our method achieves optimal performance and time-complexity. Extensive experimental results show that our design is able to achieve the state-of-the-art or comparable performance on three challenging benchmarks, surpassing most of the existing approaches and enjoying higher efficiency in inference.

Publication
In IEEE Transactions on Pattern Analysis and Machine Intelligence
Rongjie Li
Rongjie Li
PhD Students

My research interests include scene understanding, deep learning, graph neural networks.

Songyang Zhang
Songyang Zhang
Shanghai AI Lab

My research interests include few/low-shot learning, graph neural networks and video understanding.

Xuming He
Xuming He
Associate Professor

My research interests include few/low-shot learning, graph neural networks and video understanding.

Related