From Pixels to Graphs: Open-Vocabulary Scene Graph Generation with Vision-Language Models

Illustration of overall pipeline of our PGSG. We generate scene graph sequences from the images using VLM. Then, the relation construction module grounds the entities and converts categorical labels from the sequence. For VL tasks, the SGG training provides parameters as initialization for VLM in fine-tuning.

Abstract

Scene graph generation (SGG) aims to parse a visual scene into an intermediate graph representation for downstream reasoning tasks. Despite recent advancements, existing methods struggle to generate scene graphs with novel visual relation concepts. To address this challenge, we introduce a new open-vocabulary SGG framework based on sequence generation. Our framework leverages vision-language pre-trained models (VLM) by incorporating an image-to-graph generation paradigm. Specifically, we generate scene graph sequences via image-to-text generation with VLM and then construct scene graphs from these sequences. By doing so, we harness the strong capabilities of VLM for open-vocabulary SGG and seamlessly integrate explicit relational modeling for enhancing the VL tasks. Experimental results demonstrate that our design not only achieves superior performance with an open vocabulary but also enhances downstream vision-language task performance through explicit relation modeling knowledge.

Publication
In Conference on Computer Vision and Pattern Recognition 2024
Rongjie Li
Rongjie Li
PhD Students

My research interests include scene understanding, deep learning, graph neural networks.

Songyang Zhang
Songyang Zhang
Shanghai AI Lab

My research interests include few/low-shot learning, graph neural networks and video understanding.

Xuming He
Xuming He
Associate Professor

My research interests include few/low-shot learning, graph neural networks and video understanding.