Few-shot Learning for Visual Recognition

Despite recent success of deep neural networks, it remains challenging to efficiently learn new visual concepts from limited training data. To address this problem, a prevailing strategy is to build a meta-learner that learns prior knowledge on learning from a small set of annotated data. However, most of existing meta-learning approaches rely on a global representation of images and a meta-learner with complex model structures, which are sensitive to background clutter and difficult to interpret.

Visual Relationship Detection(VRD)

Reasoning about the relationships between objects in image is a crucial task for holistic scene understanding. Beyond existing works of recognition and detection, relationships between objects also constitute rich semantic information about the scene. The relationship can be represented in a triplet form of , e.g., as shown in Figure1. A crucial step towards understanding visual scene is building a structured representation that captures objects and their semantic relationships. A general way is using scene graph to describe the visual scene. Scene graph is a visually-grounded graph over the object instances in an image, where the edges depict their pairwise relationships. Figure1 shows the difference between object detection and scene graph generation, where object detection focus on recognizing existing objects in image while scene graph generation also need to capture their pairwise interactions.

Object and Scene Segmentation

We address the problem of joint detection and segmentation of multiple object instances in an image, a key step towards scene understanding. Inspired by data-driven methods, we propose an exemplar-based approach to the task of instance segmentation, in which a set of reference image/shape masks is used to find multiple objects. We design a novel CRF framework that jointly models object appearance, shape deformation, and object occlusion

  • Xuming He, Stephen Gould, An Exemplar-based CRF for Multi-instance Object Segmentation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 [pdf] [Dataset]

  • Buyu Liu, Xuming He, Stephen Gould, Multi-class Semantic Video Segmentation with Exemplar-based Object Reasoning, IEEE Winter Conference on Applications of Computer Vision (WACV), 2015 [pdf]

Holistic Video Understanding

We address the problem of integrating object reasoning with supervoxel labeling in multiclass semantic video segmentation. To this end, we first propose an object-augmented CRF in spatio-temporal domain, which captures long-range dependency between supervoxels, and imposes consistency between object and supervoxel labels. We develop an efficient inference algorithm to jointly infer the supervoxel labels, object activations and their occlusion relations for a large number of object hypotheses.

  • Buyu Liu, Xuming He, Multiclass Semantic Video Segmentation With Object-Level Active Inference, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015 [pdf] [suppl zip]

  • Buyu Liu, Xuming He, Stephen Gould, Joint Semantic and Geometric Segmentation of Videos with a Stage Model, IEEE Winter Conference on Applications of Computer Vision (WACV), 2014 [pdf]

Depth Prediction from Images

We tackle the problem of single image depth estimation, which, without additional knowledge, suffers from many ambiguities. We introduce a hierarchical representation of the scene and formulate single image depth estimation as inference in a graphical model whose edges let us encode the interactions within and across the different layers of our hierarchy. Our method therefore still produces detailed depth estimates, but also leverages higher-level information about the scene.

  • Wei Zhuo, Mathieu Salzmann, Xuming He, Miaomiao Liu, Indoor Scene Structure Analysis for Single Image Depth Estimation, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015 [pdf]

  • Miaomiao Liu, Mathieu Salzmann, Xuming He, Discrete-Continuous Depth Estimation from a Single Image, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2014 [pdf]

Object Detection in Context

Exploring contextual relations is one of the key factors to improve object detection under challenging viewing condition and to scale up recognition to large numbers of object classes. We consider two effective approaches that incorporate contextual information: object codetection, which jointly detects object instances in a set of related images, and structural Hough voting, which models the context from 2.5D perspective for object localization under heavy occlusion.

  • Zeeshan Hayder, Mathieu Salzmann, Xuming He, Object Co-Detection via Efficient Inference in a Fully-Connected CRF, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015 [pdf]

  • Tao Wang, Xuming He, Nick Barnes, Learning Structured Hough Voting for Joint Object Detection and Occlusion Reasoning, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2013 [pdf] [Dataset]

From Image to Concept

We study basic-level categories for describing visual concepts, and empirically observe context-dependant basic-level names across thousands of concepts. We propose methods for predicting basic-level names using a series of classification and ranking tasks, producing the first large-scale catalogue of basic-level names for hundreds of thousands of images depicting thousands of visual concepts. We also demonstrate the usefulness of our method with a picture-to-word task.

  • Alexander Mathews, Lexing Xie, Xuming He, Choosing Basic-Level Concept Names using Visual and Language Context, IIEEE Winter Conference on Applications of Computer Vision (WACV),2015 [pdf][suppl]

  • Lexing Xie, Xuming He, Picture Tags and World Knowledge: Learning Tag Relations from Visual Semantic Sources, The 21st ACM International Conference on Multimedia (ACM MM), 2013 [pdf]