Scene Graph Generation (SGG) remains a challenging visual understanding task due to its compositional property. Most previous works adopt a bottom-up, two-stage or point-based, one-stage approach, which often suffers from high time complexity or …
Visual grounding, which aims to build a correspondence between visual objects and their language entities, plays a key role in cross-modal scene understanding. One promising and scalable strategy for learning visual grounding is to utilize weak …
Scene graph generation is an important visual under-standing task with a broad range of vision applications.Despite recent tremendous progress, it remains challenging due to the intrinsic long-tailed class distribution and large …