Visual Reasoning

Grounded Image Text Matching with Mismatched Relation Reasoning

We introduce Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. We also propose the Relation-sensitive Correspondence Reasoning Network (RCRN) to improve the data efficiency and length generalization ability of pre-trained models.

HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

This paper proposes an efficient HOI detection framework that leverages CLIP's knowledge for better generalization.