Vision-language Representation Learing

Grounded Image Text Matching with Mismatched Relation Reasoning

We introduce Grounded Image Text Matching with Mismatched Relation (GITM-MR), a novel visual-linguistic joint task that evaluates the relation understanding capabilities of transformer-based pre-trained models. We also propose the Relation-sensitive Correspondence Reasoning Network (RCRN) to improve the data efficiency and length generalization ability of pre-trained models.