RESEARCH PROJECTS

Project Highlights

image

Feedback Networks
Currently, the most successful learning models in computer vision are based on learning successive representations followed by a decision layer. This is usually actualized through feedforward multilayer neural networks, e.g. ConvNets, where each layer forms one of such successive representations. However, an alternative that can achieve the same goal is a feedback based approach in which the representation is formed in an iterative manner based on a feedback received from previous iteration's output.
We establish that a feedback based approach has several fundamental advantages over feedforward: it enables making early predictions at the query time, its output naturally conforms to a hierarchical structure in the label space (e.g. a taxonomy), and it provides a new basis for Curriculum Learning. We observe that feedback networks develop a considerably different representation compared to feedforward counterparts, in line with the aforementioned advantages. We put forth a general feedback based learning architecture with the endpoint results on par or better than existing feedforward networks with the addition of the above advantages. We also investigate several mechanisms in feedback architectures (e.g. skip connections in time) and design choices (e.g. feedback length). We hope this study offers new perspectives in quest for more natural and practical learning models.
Check out our project page and paper.

image

Universal Correspondence Network
We have proposed a novel deep metric learning approach to visual correspondence estimation, that is shown to be advantageous over approaches that optimize a surrogate patch similarity objective. We propose several innovations, such as a correspondence contrastive loss in a fully convolutional architecture, on-the-fly active hard negative mining and a convolutional spatial transfomer. These lend capabilities such as more efficient training, accurate gradient computations, faster testing and local patch normalization, which lead to improved speed or accuracy. We demonstrate in experiments that our features perform better than prior state-of-the-art on both geometric and semantic correspondence tasks, even without using any spatial priors or global optimization. In future work, we will explore applications of our correspondences for rigid and non-rigid motion or shape estimation as well as applying global optimization.
Check out our project page and NIPS 2016 oral paper and supplementary materials.

image

Learning to Track at 100 FPS with Deep Regression Networks
Machine learning techniques are often used in computer vision due to their ability to leverage large amounts of training data to improve performance. Unfortunately, most generic object trackers are still trained from scratch online and do not benefit from the large number of videos that are readily available for offline training. We propose a method for using neural networks to track generic objects in a way that allows them to improve performance by training on labeled videos. Previous attempts to use neural networks for tracking are very slow to run and not practical for real-time applications. In contrast, our tracker uses a simple feed-forward network with no online training required, allowing our tracker to run at 100 fps during test time. Our tracker trains from both labeled video as well as a large collection of images, which helps prevent overfitting. The tracker learns generic object motion and can be used to track novel objects that do not appear in the training set. We test our network on a standard tracking benchmark to demonstrate our tracker's state-of-the-art performance. Our network learns to track generic objects in real-time as they move throughout the world.
Check out our project page and ECCV 2016 paper.

image

3D-R2N2: 3D-Recurrent Reconstruction Neural Network
Inspired by the recent success of methods that employ shape priors to achieve robust 3D reconstructions, we propose a novel recurrent neural network architecture that we call the 3D Recurrent Reconstruction Neural Network (3D-R2N2). The network learns a mapping from images of objects to their underlying 3D shapes from a large collection of synthetic data. Our network takes in one or more images of an object instance from arbitrary viewpoints and outputs a reconstruction of the object in the form of a 3D occupancy grid. Unlike most of the previous works, our network does not require any image annotations or object class labels for training or testing. Our extensive experimental analysis shows that our reconstruction framework i) outperforms the state-of-the-art methods for single view reconstruction, and ii) enables the 3D reconstruction of objects in situations when traditional SFM/SLAM methods fail (because of lack of texture and/or wide baseline).
Check out our project page, ECCV 2016 paper and code.

image

ObjectNet3D Database
We contribute a large scale database for 3D object recognition, named ObjectNet3D, that consists of 100 categories, 90,127 images, 201,888 objects in these images and 44,147 3D shapes. Objects in the images in our database are aligned with the 3D shapes, and the alignment provides both accurate 3D pose annotation and the closest 3D shape annotation for each 2D object. Consequently, our database is useful for recognizing the 3D pose and 3D shape of objects from 2D images. We also provide baseline experiments on four tasks: region proposal generation, 2D object detection, joint 2D detection and 3D object pose estimation, and image-based 3D shape retrieval, which can serve as baselines for future research using our database.
Check out our project page and ECCV 2016 paper.

image

Generic 3D Representation
What does it take to develop an agent with human-like intelligent visual perception? The popular paradigms currently employed in computer vision are problem-specific supervised learning, and to a lesser extent, unsupervised and reinforcement learning. However, we argue that none of these would lead to truly intelligent visual perception unless the learning framework is specifically devised to gain abstraction and generalization power. Here we show our approach to this problem which is inspired by the developmental stages of vision skills in humans. Specifically, rather than training a new model for every individual desired problem, we train a model to learn fundamental vision tasks that serve as the foundation for ultimately solving the desired problems. As our first effort towards validating this approach, we employ this method to learn a generic 3D representation through supervising two basic but fundamental 3D tasks. We show that the learned representation generalizes to unseen 3D tasks without the need to any fine-tuning while it achieves a human-level performance on the task it was supervised for.
Check out our project page and ECCV 2016 paper (Generic 3D Representations via Pose Estimation and Matching).

image

Structural-RNN
Deep Recurrent Neural Network architectures, though remarkably capable at modeling sequences, lack an intuitive high-level spatio-temporal structure. That is while many problems in computer vision inherently have an underlying high-level structure and can benefit from it. Spatio-temporal graphs are a popular flexible tool for imposing such high-level intuitions in the formulation of real world problems. In this paper, we propose an approach for combining the power of high-level spatio-temporal graphs and sequence learning success of Recurrent Neural Networks (RNNs). We develop a scalable method for casting an arbitrary spatio-temporal graph as a rich RNN mixture that is feedforward, fully differentiable, and jointly trainable. The proposed method is generic and principled as it can be used for transforming any spatio-temporal graph through employing a certain set of well defined steps. The evaluations of the proposed approach on a diverse set of problems, ranging from modeling human motion to object interactions, shows improvement over the state-of-the-art with a large margin. We expect this method to empower a new convenient approach to problem formulation through high-level spatio-temporal graphs and Recurrent Neural Networks, and be of broad interest to the community.
Check out our project page and CVPR 2016 paper (Best Student Paper Award).

image

Jackrabbot
In this project, we are exploring this opportunity by developing a demonstration platform to make deliveries locally within the Stanford campus.  The Stanford “Jackrabbot”, which takes it name from the nimble yet shy Jackrabbit, is a self-navigating automated electric delivery cart capable of carrying small payloads. In contrast to autonomous cars, which operate on streets and highways, the Jackrabbot is designed to operate in pedestrian spaces, at a maximum speed of five miles per hour. When people are encountered, the Jackrabbot steps aside and crouches down, to minimize impact on human activities. Its progress and status can be monitored remotely via GPS, video, and audio.
project page

image

3D Semantic Parsing of Large-Scale Indoor Spaces
In this project, we propose a method for parsing the point cloud of an entire building using a hierarchical approach: first, the raw point cloud is parsed into semantically meaningful spaces (e.g. rooms, hallways, etc) that are aligned into a canonical reference coordinate system. Second, each of these spaces are parsed into their structural and building elements (e.g. walls, columns, etc). Performing these with a strong notation of global 3D space is the backbone of our method. The alignment in the first step enables injecting strong 3D priors from the canonical coordinate system into the second step for discovering elements. This allows handling challenging scenarios as man-made indoor spaces often show recurrent geometric patterns while the appearance features can change drastically.
Check out our CVPR 2016 paper and the project page.

image

Deep Metric Learning via Lifted Structured Feature Embedding
Learning the distance metric between pairs of examples is of great importance for learning and visual recognition. With the remarkable success from the state of the art convolutional neural networks, recent works have shown promising results on discriminatively training the networks to learn semantic feature embeddings where similar examples are mapped close to each other and dissimilar examples are mapped farther apart. In this project, we describe an algorithm for taking full advantage of the training batches in the neural network training by lifting the vector of pairwise distances within the batch to the matrix of pairwise distances. This step enables the algorithm to learn the state of the art feature embedding by optimizing a novel structured prediction objective on the lifted problem. Additionally, we collected Stanford Online Products dataset: 120k images of 23k classes of online products for metric learning.
Check out our CVPR 2016 paper and the project page.

image

DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes
Estimating the layout of a room (such as the ceiling, floors, and walls) given a monocular RGB image is essential for various tasks and applications such as autonomous navigation, scene understanding, and even augmented reality. In this work, we propose an approach for the layout estimation problem by using a fully convolutional neural network in conjunction with a novel optimization framework that refines the neural network output and produces valid layouts. Our method is robust to clutter and works on a wide range of challenging scenes, achieving state-of-the-art results on two leading room layout datasets and outperforming prior methods by a large margin.
Check out our CVPR 2016 paper and project page.

top

Scene Understanding

image

3D Semantic Parsing of Large-Scale Indoor Spaces
In this project, we propose a method for parsing the point cloud of an entire building using a hierarchical approach: first, the raw point cloud is parsed into semantically meaningful spaces (e.g. rooms, hallways, etc) that are aligned into a canonical reference coordinate system. Second, each of these spaces are parsed into their structural and building elements (e.g. walls, columns, etc). Performing these with a strong notation of global 3D space is the backbone of our method. The alignment in the first step enables injecting strong 3D priors from the canonical coordinate system into the second step for discovering elements. This allows handling challenging scenarios as man-made indoor spaces often show recurrent geometric patterns while the appearance features can change drastically.
Check out our CVPR 2016 paper and the project page.

image

DeLay: Robust Spatial Layout Estimation for Cluttered Indoor Scenes
Estimating the layout of a room (such as the ceiling, floors, and walls) given a monocular RGB image is essential for various tasks and applications such as autonomous navigation, scene understanding, and even augmented reality. In this work, we propose an approach for the layout estimation problem by using a fully convolutional neural network in conjunction with a novel optimization framework that refines the neural network output and produces valid layouts. Our method is robust to clutter and works on a wide range of challenging scenes, achieving state-of-the-art results on two leading room layout datasets and outperforming prior methods by a large margin.
Check out our CVPR 2016 paper.

image

Understanding Indoor Scenes using 3D Geometric Phrases
In this project we present a hierarchical scene model for learning and reasoning about complex indoor scenes that can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase model which captures the semantic and geometric relationships between objects which frequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.
Check out our CVPR 2013 paper (selected as an oral presentation) and the project page. This work is supported by ONR grant N00014111038 and a gift award from HTC.

image

Relating Things and Stuff via Object Property Interactions
In this project we propose a new framework for scene understanding that jointly models things (i.e., object categories with a well-defined shape such as people and cars) and stuff (i.e., object categories with an amorphous spatial extent such as grass and sky). Our framework allows enforcing sophisticated geometric and semantic relationships between thing and stuff categories in a single graphical model. We demonstrate that our method achieves competitive performances in segmenting and detecting objects on several public datasets.
Check out our HitPot ECCV 2012 paper and the project page. This research is in collaboration with Byung Soo Kim, Min Sun and Pushmeet Kohli (Microsoft Research) and is sponsored by the Gigascale Systems Research Center and NSF CPS grant #0931474.

image

Semantic Structure from Motion (SSFM)
We propose a new framework for jointly recognizing objects as well as reconstructing the underlying 3D geometry of the scene (cameras, points and objects). In our SSFM framework we leverage the intuition that measurements of keypoints and objects must be semantically and geometrically consistent across view points. Our SSFM framework has the unique ability to: i) estimate camera poses from object detections only; ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms; iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images.
Check out our CVPR 2012 paper and poster, our CVPR 2011 paper and poster, our book chapter, and our CORP-ICCV 2011 paper. Our CORP-ICCV 2011 paper is the winner of the best student paper. A newly proposed Microsoft KINECT dataset for evaluation can found here. In collaboration with Sid Ying-ze Bao. This research is sponsored by the Giga Scale Research Center and NSF CAREER #1054127.

image

Monitoring with D4AR (4 Dimensional Augmented Reality) Models
In this research, construction progress deviations between as-planned and as-built construction are measured through superimposition of as-planned model onto site photographs for different time stamps. Our approach is based on sparse 3D reconstruction and recognition of as-built scene elements using state-of-the-art machine learning methodolgies.
Click here for our recent WCVRS-ICCV 2012 paper. For an earlier version please refer to our ITCON journal paper. This line of work won the best CRC poster award at the Construction Research Congress (Seattle 2009), the Best Student Paper Award at the The 6th Int. Conference on Innovation in AEC in 2010, and the 2012 Best Paper Award from the Journal of Construction Engineering and Management. In collaboration with Mani Golparvar-Fard and Feniosky Pena-Mora; Sponsored by NSF grant #0800500 and KLA-Tencor.

image

Coherent Object Recognition and Scene Layout Understanding
We propose a new coherent framework for joint object detection and scene 3D layout estimation from a single image. We explore the geometrical relationship among objects, the physical space includingthe objects and the observer. In our framework object detection becomes more and more accurate as additional evidence about a specific scene becomes available. In turn, improved detection results enable more stable and accurate estimates of the scene layout and object supporting surfaces.
Click here for the journal paper published in Image and Vision Computing, here for BMVC 10 oral, and here for the CVPR 10 paper (more details). Our Desk-top Dataset is now available. In collaboration with Sid Yingze Bao and Min Sun. This research is sponsored by the Giga Scale Research Center and NSF grant #0931474.

image

Scene categorization and understanding from videos
We present a new method for categorizing video sequences capturing different scene classes. This can be seen as a generalization of previous work on scene classification from single images. A scene is represented by a collection of 3D points with an appearance based codeword attached to each point. A hierarchical structure of histograms located at different locations and at different scales is used to capture the typical spatial distribution of 3D points and codewords in the working volume.
Click here for our ICCV 09 paper and here for the database. In collaboration with Paritosh Gupta, Sankalp Arrabolu and Matthew Brown.

top

Object Detection/Recognition

image

Enriching Object Detection with 2D-3D Registration and Continuous Viewpoint Estimation We propose an efficient method for synthesizing templates from 3D models that runs on the fly -- that is, it quickly produces detectors for an arbitrary viewpoint of a 3D model without expensive dataset-dependent training or template storage. Given a 3D model and an arbitrary continuous detection viewpoint, our method synthesizes a discriminative template by extracting features from a rendered view of the object and decorrelating spatial dependences among the features. Our decorrelation procedure relies on a gradient-based algorithm that is more numerically stable than standard decomposition-based procedures, and we efficiently search for candidate detections by computing FFT-based template convolutions. Due to the speed of our template synthesis procedure, we are able to perform joint optimization of scale, translation, continuous rotation, and focal length using Metropolis-Hastings algorithm. We provide an efficient GPU implementation of our algorithm, and we validate its performance on 3D Object Classes and PASCAL3D+ datasets.
Check out our CVPR 2015 paper and the code

image

A Coarse-to-Fine Model for 3D Pose Estimation and Sub-category Recognition
Despite the fact that object detection, 3D pose estimation, and sub-category recognition are highly correlated tasks, they are usually addressed independently from each other because of the huge space of parameters. To jointly model all of these tasks, we propose a coarse-to-fine hierarchical representation, where each level of the hierarchy represents objects at a different level of granularity. The hierarchical representation prevents performance loss, which is often caused by the increase in the number of parameters (as we consider more tasks to model), and the joint modeling enables resolving ambiguities that exist in independent modeling of these tasks.
Check out our CVPR 2015 paper and the project page. The project is sponsored by ONR grant N00014-13-1-0761 and NSF CAREER 1054127. This project is sponsored by NSF CAREER grant (N1054127), Ford-Stanford Innovation Alliance Award, DARPA.

image

Data-Driven 3D Voxel Patterns for Object Category Recognition
In this wrok, we propose a novel object representation, 3D Voxel Pattern (3DVP), that jointly encodes the key properties of objects including appearance, 3D shape, viewpoint, occlusion and truncation. We discover 3DVPs in a data-driven way, and train a bank of specialized detectors for a dictionary of 3DVPs. The 3DVP detectors are capable of detecting objects with specific visibility patterns and transferring the meta-data from the 3DVPs to the detected objects, such as 2D segmentation mask, 3D pose as well as occlusion or truncation boundaries. The transferred meta-data allows us to infer the occlusion relationship among objects, which in turn provides improved object recognition results.
Check out our CVPR 2015 paper and the project page. The project is sponsored by NSF CAREER grant N.1054127, ONR award N000141110389, and DARPA UPSIDE grant A13-0895-S002.

image

Beyond PASCAL: A Benchmark for 3D Object Detection in the Wild
In this project, we contribute PASCAL3D+ dataset, which is a novel and challenging dataset for 3D object detection and pose estimation. PASCAL3D+ augments 12 rigid categories of the PASCAL VOC 2012 with 3D annotations. Furthermore, more images are added for each category from ImageNet. PASCAL3D+ images exhibit much more variability compared to the existing 3D datasets, and on average there are more than 3,000 object instances per category. We believe this dataset will provide a rich testbed to study 3D detection and pose estimation and will help to significantly push forward research in this area.
Check out our WACV 2014 paper and the project page. The project is sponsored by ONR grant N00014-13-1-0761 and NSF CAREER grant #1054127.

image

Object Detection by 3D Aspectlets and Occlusion Reasoning
In this project, we propose a novel framework for detecting multiple objects from a single image and reasoning about occlusions between objects. We address this problem from a 3D perspective in order to handle various occlusion patterns which can take place between objects. We introduce the concept of “3D aspectlets†based on a piecewise planar object representation. A 3D aspectlet represents a portion of the object which provides evidence for partial observation of the object. A new probabilistic model (which we called spatial layout model) is proposed to combine the bottom-up evidence from 3D aspectlets and the top-down occlusion reasoning to help object detection.
Check out our 3DRR 2013 paper and the project page. The project is sponsored by NSF CAREER grant #1054127, NSF CPS grant #0931474 and a HTC award.

image

Accurate Localization of 3D Objects from RGB-D Data using Segmentation Hypotheses
In this project we focus on the problem of detecting objects in 3D from RGB-D images. We propose a novel framework that explores the compatibility between segmentation hypotheses of the object in the image and the corresponding 3D map. Our framework allows to discover the optimal location of the object using a generalization of the structural latent SVM formulation in 3D. Extensive quantitative and qualitative experimental show that our proposed approach outperforms state-of-the-art as methods for both 3D and 2D object recognition tasks.
Check out our CVPR 2013 paper and the project page. This research is in collaboration with Byung-soo Kim and Shili Xu, and is sponsored by the ARO grant W911NF-09-1-0310, NSF CPS grant #0931474 and a KLA-Tencor Fellowship.

image

Weakly Supervised Learning of Mid-Level Features with Beta-Bernoulli Process Restricted Boltzmann Machines
In this project we develop a new approach to learn mid-level features which capture recognizable semantic concepts. This is achieved by using a weakly supervised approach based on restricted Boltzmann machine (RBM) to learn mid-level features, where only class-level supervision is provided during training. Our experimental results on object recognition tasks show significant performance gains, outperforming existing methods which rely on manually labeled semantic attributes.
Check out our CVPR 2013 paper and the project page. This research is in collaboration with Prof Honglak Lee and Benjamin Kuipers and is sponsored by NSF CPS grant #0931474.

image

Understanding Indoor Scenes using 3D Geometric Phrases
In this project we present a hierarchical scene model for learning and reasoning about complex indoor scenes that can be learned from a reasonable amount of training data, and avoids oversimplification. At the core of this approach is the 3D Geometric Phrase model which captures the semantic and geometric relationships between objects which frequently co-occur in the same 3D spatial configuration. Experiments show that this model effectively explains scene semantics, geometry and object groupings from a single image, while also improving individual object detections.
Check out our CVPR 2013 paper (selected as an oral presentation) and the project page. This work is supported by ONR grant N00014111038 and a gift award from HTC.

image

Object Co-detection
Given a set of images containing the same objects, the goal of co-detection is to detect the objects simultaneously in multiple images, as well as match individual instances across images. Our method can effectively measure object self-occlusions and viewpoint transformations. The co-detector is able to obtain more accurate detection results than if objects were to be detected from each image individually.
Check out our ECCV 2012 paper and the project page. This research is in collaboration with Yingze Bao and Yu Xiang. It is sponsored by NSF CAREER #1054127 and NSF CPS grant #0931474.

image

Estimating the Aspect Layout of Object Categories
In this project we seek to move away from the traditional paradigm for 2D object recognition whereby objects are identified in the image as 2D bounding boxes. We focus instead on: i) detecting objects; ii) identifying their 3D poses; iii) characterizing the geometrical and topological properties of the objects in terms of their aspect configurations in 3D.
Check out our CVPR 2012 paper and poster. The source code can be downloaded from here. This research is in collaboration with Yu Xiang and is sponsored by ARO W911NF-09-1-0310, NSF CPS grant #0931474 and a KLA-Tencor Fellowship.

image

Mobile Object Detection through Client-Server based Vote Transfer
In this work we present a novel multi-frame object detection application for the mobile platform that is capable of object localization. We implement the multi-frame detector on a mobile device running the android OS through a novel client-server framework that presents a sound and viable environment for the multi-frame detector.
Check out our CVPR 2012 paper and poster. This research is in collaboration with Shyam Kumar and Min Sun and is sponsored by a Google Research Award and the Gigascale Systems Research Center.

image

Joint Detection and Pose Estimation of Articulated Objects
In this project, we propose an new model called Articulated Part-based Model (APM) for jointly detecting objects (e.g humans) and estimating their poses (e.g. configuration of body parts such as arms, torso, head). APM recursively represents an object as a collection of parts at multiple levels of detail, from coarse-to-fine, where parts at every level are connected to a coarser level through a parent-child relationship. Extensive quantitative and qualitative experimental results on public datasets show that APM outperforms state-of-the-art methods.
In collaboration with Min Sun. Click here for our project webpage and the dataset. Check out our ICCV 2011 paper and poster. This research is sponsored by ARO.

image

Hierarchical Classification of Images by Sparse Approximation
In this project we show that the hierarchical structure of a database can be used successfully to enhance classification accuracy using a sparse approximation framework. We propose a new formulation for sparse approximation where the goal is to discover the sparsest path within the hierarchical data structure that best represents the query object. Extensive quantitative and qualitative experimental evaluation on a number of subsets of the ImageNet database demonstrates our theoretical claims and shows that our approach produces better hierarchical categorization results than competing techniques.
Check out our BMVC 2011 paper and poster. In collaboration with Bying-Soo Kim, Jae Young Park, prof Anna Gilbert; Sponsored by a Google Research Award and the Gigascale Systems Research Center.

image

Depth-Encoded Hough Voting for Joint Object Detection and Shape Recovery
In this project we aim at simultaneously detecting objects, estimating their pose and recovering their 3D shape. We propose a new method called DEHV - Depth-Encoded Hough Voting. DEHV is a probabilistic Hough voting scheme which incorporates depth information into the process of learning distributions of image features (patches) representing an object category. Extensive quantitative and qualitative experimental analysis on existing and newly proposed datasets demonstrates that our approach achieves convincing recognition results and is capable of estimating object shape from just a single image!
Check out our ECCV 2010 paper and poster, and our DIM-PVT 2011 oral paper. Our Dataset (Released Sep,13,2010) is now available.
In collaboration with Min Sun, Gary Bradski, and Bing-Xin Xu. This research is sponsored by the Giga Scale Research Center and NSF grant #0931474.

image

Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories
We propose a new 3D object class model that is capable of recognizing unseen views by pose estimation and synthesis. We achieve this by using a dense, multiview representation of the viewing sphere parameterized by a triangular mesh of viewpoints. Each triangle of viewpoints can be morphed to synthesize new viewpoints. By incorporating 3D geometrical constraints, our model establishes explicit correspondences among object parts across viewpoints. Click here for our ICCV 09 oral; or here for the CVPR 09 version.
Details on our early work on multi-view object representation can be found in our ICCV 07 paper, or our ECCV 08 paper. Click here for our dataset on multi-view object categories. In collaboration with Min Sun, Hao Su, and Fei-Fei Li.

image

Recognizing the 3D Pose of Object Categories by Manifold Modeling
We propose a novel statistical manifold modeling approach that is capable of classifying poses of object categories from video sequences by simultaneously minimizing the intra-class variability and maximizing inter-pose distance. We model an object category as a collection of non-parametric probability density functions (PDFs) capturing appearance and geometrical changes. We show that the problem of simultaneous intra-class variability minimization and inter-pose distance maximization is equivalent to the one of aligning and expanding the manifolds generated by these non-parametric PDFs, respectively.
Click here for our ICCV 2011 work and here for an earlier version (BMVC 09). In collaboration with Liang Mei, Jingen Liu and Alfred Hero; Sponsored by ARO Army and NSF grant #0931474.

image

Capturing spatial-temporal relationships for object and action categorization
In this work we present a novel framework for learning the typical spatial-temporal relationship between object parts. These relationships are incorporated in a flexible model for object and action categorization.
Click here for the original CVPR 06 paper or here for the WMVC08 work on action categorization. In collaboration with Andrey Del Pozo, Juan Carlos Niebles, Li Fei-Fei, John Winn and Antonio Criminisi.

top

Activity Recognition

image

Jackrabbot
In this project, we are exploring this opportunity by developing a demonstration platform to make deliveries locally within the Stanford campus.  The Stanford “Jackrabbot”, which takes it name from the nimble yet shy Jackrabbit, is a self-navigating automated electric delivery cart capable of carrying small payloads. In contrast to autonomous cars, which operate on streets and highways, the Jackrabbot is designed to operate in pedestrian spaces, at a maximum speed of five miles per hour. When people are encountered, the Jackrabbot steps aside and crouches down, to minimize impact on human activities. Its progress and status can be monitored remotely via GPS, video, and audio.
project page

image

Discovering Groups of People in Images
Understanding group activities from images is an important yet challenging task. This is because there is an exponentially large number of semantic and geometrical relationships among individuals that one must model in order to effectively recognize and localize the group activities. Rather than focusing on directly recognizing group activities as most of the previous works do, we advocate the importance of introducing an intermediate representation for modeling groups of humans which we call structure groups. We contribute a method for identifying and localizing these structured groups in a single image despite their varying viewpoints, number of participants, and occlusions. We contribute an extremely challenging new dataset that contains images each showing multiple people performing multiple activities. Extensive evaluation confirms our theoretical findings. project page

image

A Unified Framework for Multi-Target Tracking and Collective Activity Recognition
In this project we present a novel coherent, discriminative framework for simultaneously tracking multiple people and estimating their collective activities. Instead of treating these two problems separately, our model is grounded in the intuition that a strong correlation exists between a person's motion, their activity, and the motion and activities of other nearby people. We introduce a hierarchy of activity types that create a natural progression that leads from a specific person’s motion to the activity of the group as a whole. Experimental results on challenging video datasets demonstrate our theoretical claims and indicate that our model achieves the best collective activity classification results to date.
Check out our ECCV 2012 paper (selected as an oral presentation) and the project page. This project is in collaboration with Wongun Choi and is sponsored by ONR.

image

Collective Activities Recognition
We present a framework for the recognition of collective human activities. A collective activity is defined or reinforced by the existence of coherent behavior of individuals in time and space. We call such coherent behavior crowd context. Examples of collective activities are "queuing in a line" or "talking". We propose to recognize collective activities using the crowd context and introduce a new scheme for learning it automatically. Our scheme is constructed upon a Random Forest structure which randomly samples variable spatio-temporal regions to pick the most discriminating characteristics for classification.
Check out our CVPR 2011 paper and poster, our VSWS09 paper. Our dataset on collective activity classification can found here. In collaboration with Wongun Choi and Khuram Shahid. This research is sponsored by ONR and NSF EAGER.

image

Human Actions Recognition by Attributes
We explore the idea of using high-level semantic concepts (attributes) to represent human actions from videos and argue that action attributes enable the construction of more descriptive models for human action recognition. We propose a unified framework wherein manually specified attributes are: i) selected in a discriminative fashion so as to account for intra-class variability; ii) integrated coherently with data-driven attributes to make the attribute set more descriptive.
Click here for our CVPR 2011 oral paper. In collaboration with Jingen Liu and Ben Kuipers. This research is sponsored by NSF #0931474.

image

Cross-View Actions Recognition via View Knowledge Transfer
We present a novel approach for recognizing human actions from different views by establishing connections between view-dependent features across views (knowledge transfer). We introduce a novel approach for discovering these connections by using a bipartite graph to model view-dependent vocabularies of codewords. We then apply bipartite graph partitioning to co-cluster the vocabularies into visual-word clusters called bilingual-words (i.e., high-level features), which is capable of bridging the semantic gap between view-dependent vocabularies.
Click here for our CVPR 2011 oral paper. In collaboration with Jingen Liu, Mubarak Shah and Ben Kuipers. This research is sponsored by NSF #0931474.

image

Capturing spatial-temporal relationships for object and action categorization
In this work we present a novel framework for learning the typical spatial-temporal relationship between object parts. These relationships are incorporated in a flexible model for object and action categorization.
Click here for the original CVPR 06 paper or here for the WMVC08 work on action categorization. In collaboration with Andrey Del Pozo, Juan Carlos Niebles, Li Fei-Fei, John Winn and Antonio Criminisi.

top

Tracking

image

Learning to Track: Online Multi-Object Tracking by Decision Making
Online Multi-Object Tracking (MOT) has wide applications in time-critical video analysis scenarios, such as robot navigation and autonomous driving. In trackingby-detection, a major challenge of online MOT is how to robustly associate noisy object detections on a new video frame with previously tracked objects. In this work, we formulate the online MOT problem as decision making in Markov Decision Processes (MDPs), where the lifetime of an object is modeled with a MDP. Learning a similarity function for data association is equivalent to learning a policy for the MDP, and the policy learning is approached in a reinforcement learning fashion which benefits from both advantages of offline-learning and online-learning for data association. Moreover, our framework can naturally handle the birth/death and appearance/disappearance of targets by treating them as state transitions in the MDP while leveraging existing online single object tracking methods.
Check out our ICCV 2015 paper and the project page. The project is sponsored by DARPA UPSIDE grant A13-0895-S002.

image

Combining 3D Shape, Color, and Motion for Robust Anytime Tracking
Although object tracking has been studied for decades, real-time tracking algorithms often suffer from low accuracy and poor robustness when confronted with difficult, real-world data. We present a tracker that combines 3D shape, color (when available), and motion cues to accurately track moving objects in real-time. Our tracker allocates computational effort based on the shape of the posterior distribution. Starting with a coarse approximation to the posterior, the tracker successively refines this distribution, increasing in tracking accuracy over time. The tracker can thus be run for any amount of time, after which the current approximation to the posterior is returned. Even at a minimum runtime of 0.7 milliseconds, our method outperforms all of the baseline methods of similar speed by at least 10%. If our tracker is allowed to run for longer, the accuracy continues to improve, and it continues to outperform all baseline methods. Our tracker is thus anytime, allowing the speed or accuracy to be optimized based on the needs of the application. project page

image

Monocular Multiview Object Tracking with 3D Aspect Parts
In this work, we focus on the problem of tracking objects under significant viewpoint variations, which poses a big challenge to traditional object tracking methods. We propose a novel method to track an object and estimate its continuous pose and part locations under severe viewpoint change. In order to handle the change in topological appearance introduced by viewpoint transformations, we represent objects with 3D aspect parts and model the relationship between viewpoint and 3D aspect parts in a part-based particle filtering framework. Moreover, we show that instance-level online-learned part appearance can be incorporated into our model, which makes it more robust in difficult scenarios with occlusions.
Check out our ECCV 2014 paper and the project page. The project is sponsored by DARPA UPSIDE grant A13-0895-S002 and NSF CAREER grant N.1054127.

image

A Unified Framework for Multi-Target Tracking and Collective Activity Recognition
In this project we present a novel coherent, discriminative framework for simultaneously tracking multiple people and estimating their collective activities. Instead of treating these two problems separately, our model is grounded in the intuition that a strong correlation exists between a person's motion, their activity, and the motion and activities of other nearby people. We introduce a hierarchy of activity types that create a natural progression that leads from a specific person’s motion to the activity of the group as a whole. Experimental results on challenging video datasets demonstrate our theoretical claims and indicate that our model achieves the best collective activity classification results to date.
Check out our ECCV 2012 paper (selected as an oral presentation) and the project page. This project is in collaboration with Wongun Choi and is sponsored by ONR.

image

Multi-target tracking from a Single Moving Camera
We propose a coherent tracking framework which is capable of tracking multiple number of targets under unknown monocular camera motion. Our frameworks also models the interaction between targets which further helps disambiguate correspondence between targets and detections. To efficiently solve this complex inference problem, a MCMC particle filtering algorithm is incorporated. Experimental results show our algorithm is superior or comparable to the state-of-the-art multi-target system even though our algorithm only uses a single un-calibrated camera. We also tested our algorithm on a moving robotic platform equipped with a depth sensor and demonstrated higly accurate tracking capabilities.
Check out our ECCV 2010 paper and poster, our WCORP-ICCV 2012 paper. In collaboration with Wongun Choi and Caroline Pantofaru (Willow Garage). This research is sponsored by NSF-EAGER, TOYOTA and ONR.

top

Scene and Object Reconstruction

image

Dense Object Reconstruction with Semantic Priors
In this project we propose a new technique for dense 3D object reconstruction from multiple views. Our method overcomes the drawbacks of traditional multi-view stereo by incorporating semantic information in the form of learned category-level shape priors and object detection. Extensive qualitative and quantitative evaluations show that our framework can produce more accurate reconstructions than alternative state-of-the-art 3D reconstruction systems.
Check out our CVPR 2013 paper and the project page. The project is sponsored by NSF CAREER grant 1054127.

image

Semantic Structure from Motion (SSFM)
We propose a new framework for jointly recognizing objects as well as reconstructing the underlying 3D geometry of the scene (cameras, points and objects). In our SSFM framework we leverage the intuition that measurements of keypoints and objects must be semantically and geometrically consistent across view points. Our SSFM framework has the unique ability to: i) estimate camera poses from object detections only; ii) enhance camera pose estimation, compared to feature-point-based SFM algorithms; iii) improve object detections given multiple uncalibrated images, compared to independently detecting objects in single images.
Check out our CVPR 2012 paper and poster, our CVPR 2011 paper and poster, our book chapter, and our CORP-ICCV 2011 paper. Our CORP-ICCV 2011 paper is the winner of the best student paper. A newly proposed Microsoft KINECT dataset for evaluation can found here. In collaboration with Sid Ying-ze Bao. This research is sponsored by the Giga Scale Research Center and NSF CAREER #1054127.

image

Monitoring with D4AR (4 Dimensional Augmented Reality) Models
In this research, construction progress deviations between as-planned and as-built construction are measured through superimposition of as-planned model onto site photographs for different time stamps. Our approach is based on sparse 3D reconstruction and recognition of as-built scene elements using state-of-the-art machine learning methodolgies.
Click here for our recent WCVRS-ICCV 2012 paper. For an earlier version please refer to our ITCON journal paper. This line of work won the best CRC poster award at the Construction Research Congress (Seattle 2009), the Best Student Paper Award at the The 6th Int. Conference on Innovation in AEC in 2010, and the 2012 Best Paper Award from the Journal of Construction Engineering and Management. In collaboration with Mani Golparvar-Fard and Feniosky Pena-Mora; Sponsored by NSF grant #0800500 and KLA-Tencor.

image

Depth-Encoded Hough Voting for Joint Object Detection and Shape Recovery
In this project we aim at simultaneously detecting objects, estimating their pose and recovering their 3D shape. We propose a new method called DEHV - Depth-Encoded Hough Voting. DEHV is a probabilistic Hough voting scheme which incorporates depth information into the process of learning distributions of image features (patches) representing an object category. Extensive quantitative and qualitative experimental analysis on existing and newly proposed datasets demonstrates that our approach achieves convincing recognition results and is capable of estimating object shape from just a single image!
Check out our ECCV 2010 paper and poster, and our DIM-PVT 2011 oral paper. Our Dataset (Released Sep,13,2010) is now available.
In collaboration with Min Sun, Gary Bradski, and Bing-Xin Xu. This research is sponsored by the Giga Scale Research Center and NSF grant #0931474.

image

Learning a dense multi-view representation for detection, viewpoint classification and synthesis of object categories
We propose a new 3D object class model that is capable of recognizing unseen views by pose estimation and synthesis. We achieve this by using a dense, multiview representation of the viewing sphere parameterized by a triangular mesh of viewpoints. Each triangle of viewpoints can be morphed to synthesize new viewpoints. By incorporating 3D geometrical constraints, our model establishes explicit correspondences among object parts across viewpoints. Click here for our ICCV 09 oral; or here for the CVPR 09 version.
Details on our early work on multi-view object representation can be found in our ICCV 07 paper, or our ECCV 08 paper. Click here for our dataset on multi-view object categories. In collaboration with Min Sun, Hao Su, and Fei-Fei Li.

top

Human Pose Estimation

image

Efficient and Exact Branch-and-Bound
We propose a novel Branch-and-Bound (BB) method to efficiently solve exact MAP-MRF inference on problems with a large number of states (per variable). In our work, we evaluate different variants of our proposed BB algorithm and a state-of-the-art exact inference algorithm on synthetic data, human pose estimation from both a single image and a video sequence, and protein design problems.
Check out our AISTATS 2012 paper, our CVPR 2012 paper and poster. Source Code of the BB algorithm can be downloaded from here. This research is in collaboration with Prof. Honglak Lee and Prof. Silvio Savarese. It is sponsored by the ONR grant N000141110389, ARO grant W911NF-09-1-0310, and the Google Faculty Research Award.

image

Joint Detection and Pose Estimation of Articulated Objects
In this project, we propose an new model called Articulated Part-based Model (APM) for jointly detecting objects (e.g humans) and estimating their poses (e.g. configuration of body parts such as arms, torso, head). APM recursively represents an object as a collection of parts at multiple levels of detail, from coarse-to-fine, where parts at every level are connected to a coarser level through a parent-child relationship. Extensive quantitative and qualitative experimental results on public datasets show that APM outperforms state-of-the-art methods.
In collaboration with Min Sun. Click here for our project webpage and the dataset. Check out our ICCV 2011 paper and poster. This research is sponsored by ARO.

top

Material Understanding and Reconstruction

image

Perception of Reflective materials
How well can the human visual system see the shape of a reflective object? How does the brain identify a specular reflection as such and not as a piece of texture attached to the surface? Psychophysics analysis sheds light to these intriguing questions.
Click here for details. In collaboration with Andrey Del Pozo and Dan Simons; Sponsored by NSF grant #0413312.

image

Identification and recognition of reflective surfaces
Recognizing and localizing specular (or mirror-like) surfaces from a single image is a great challenge to computer vision. Unlike other materials, the appearance of a specular surface changes as function of the surrounding environment as well as the position of the observer. Even though the reflection on a specular surface has an intrinsic ambiguity that might be resolved by high level reasoning, we have demonstrated that we can take advantage of low level features to successfully recognize specular surfaces in real world scenes.
Click here for our CVPR 07 paper. In collaboration with Andrey DelPozo; Sponsored by NSF grant #0413312.

image

Analysis of paintings in Renaissance art
We study geometrical properties of the mirror in the Hans Memling’s 1487 diptych Virgin and Child and Maarten van Nieuwenhove. We aim at discovering if the mirror was not part of the painting’s initial design, but instead added later by the artist.
Click here for the article, and here for the video. In collaboration with Ron Spronk, David Stork, and Andrey DelPozo; Sponsored by NSF grant #0413312.

top