I am a CS PhD student advised by Sergey Levine. I also work closely with Larry Zitnick. My research is focused towards making autonomous robot controllers using deep reinforcement learning and vision. In the past, I have worked on various problems in the visual understanding and recognition.
~ Sept. 2017 ~ At IROS 2017, I will be giving an invited talk in the workshop on Vision-based Agile Autonomous Navigation of UAVs .
~ June. 2017 ~ I am spending time at Google Brain Robotics for internship
~ May. 2017 ~ Won the NVIDIA Graduate Fellowship 2017- 2018 award!
~ Oct. 2016 ~ I will be serving as a program committee member in CVPR'17.
~ Jun. 2016 ~ Spending my summer as a research intern at Magic Leap ! It feels magical to be a leaper :)
~ Apr. 2016 ~ I am guest lecturer at ML course (CSEP546) for Spring'16 quarter.
~ Feb. 2016 ~ I will be serving as a program committee member in ECCV'16.
~ Feb. 2016 ~ I am guest lecturer at AI course (CSEP573) for Winter'16 quarter.
~ Nov. 2015 ~ I will be serving as a program committee member in CVPR'16.
~ Oct. 2015 ~ I am guest lecturer at AI course (CSE473) for Fall'15 quarter.
~ Sep. 2015 ~ I am organizing Vision Seminar (CSE 590v) for the Fall'15 quarter.
~ Sep. 2015 ~ Our Visual analogy paper got accepted in NIPS'15.
~ Sep. 2015 ~ Our Visual entailment paper got accepted as oral in ICCV'15.
~ Mar. 2015 ~ VisKE got accepted in CVPR'15.
~ Jan. 2015 ~ My album creation paper is featured in EurekAlert.
~ Dec. 2014 ~ I will organize the GRAIL seminar (cse 591) for the Spring'15 quarter.
~ Nov. 2014 ~ I will be serving as a program committee member in CVPR'15.
~ May. 2017 ~ GPU Technology Conference 2017
~ May. 2017 ~ Symposium on Robot Learning 2017
~ April. 2017 ~ BAIR Seminar
We propose CAD2RL, a flight controller for Collision Avoidance via Deep Reinforcement Learning that can be used to perform collision-free flight in the real world although it is trained entirely in a 3D CAD model simulator. Our method uses only single RGB images from a monocular camera mounted on the robot as the input and is specialized for indoor hallway following and obstacle avoidance.
In this paper, we study the problem of answering visual analogy questions. These questions take the form of image A is to image B as image C is to what. Answering these questions entails discovering the mapping from image A to image B and then extending the mapping to image C and searching for the image D such that the relation from A to B holds for C to D. We pose this problem as learning an embedding that encourages pairs of analogous images with similar transformations to be close together using convolutional neural networks with a quadruple Siamese architecture.
We introduce Segment-Phrase Table (SPT), a large collection of bijective associations between textual phrases and their corresponding segmentations. We show that fine-grained textual labels facilitate contextual reasoning that helps in satisfying semantic constraints across image segments. This feature enables us to achieve state-of-the-art segmentation results on benchmark datasets. We also show that the association of high-quality segmentations to textual phrases aids in richer semantic understanding and reasoning of these textual phrases which motivates the problem of visual entailment and visual paraphrasing.
In this work, we introduce the problem of visual verification of relation phrases and developed a Visual Knowledge Extraction system called VisKE. Given a verb-based relation phrase between common nouns, our approach assess its validity by jointly analyzing over text and images and reasoning about the spatial consistency of the relative configurations of the entities and the relation involved. Our approach involves no explicit human supervision thereby enabling large-scale analysis. Using our approach, we have already verified over 12000 relation phrases. Our approach has been used to not only enrich existing textual knowledge bases by improving their recall, but also augment open domain question-answer reasoning.
In this paper, we propose a method to learn scene structures that can encode three main interlacing components of a scene: the scene category, the context-specific appearance of objects, and their layout. Our experimental evaluations show that our learned scene structures outperform state-of-the-art method of Deformable Part Models in detecting objects in a scene. Our scene structure provides a level of scene understanding that is amenable to deep visual inferences. The scene struc- tures can also generate features that can later be used for scene categorization. Using these features, we also show promising results on scene categorization.
we propose the problem of automatic photo album creation from an unordered image collection. To help solve this problem, we collect a new benchmark dataset based on Flicker images. We analyze the problem and provide experimental evidence, through user studies, that both selection and ordering of photos within an album is important for human observers. To capture and learn rules of album composition, we propose a discriminative structured model capable of encoding simple prefer ences for contextual layout of the scene and ordering between photos. The parameters of the model are learned using a structured SVM framework.
Despite the attractiveness and simplicity of producing word clouds, they do not provide a thorough visualization for the distribution of the underlying data. Our proposed method is able to decode an input word cloud visualization and provides the raw data in the form of a list of (word, value) pairs. To the best of our knowledge our work is the first attempt to extract raw data from word cloud visualization. The results of our experiments show that our algorithm is able to extract the words and their weights effectively with considrerable low error rate.
In this paper we proposed a simple but efficient image representation for solving the scene classification problem. Our new representation combines the benefits of spatial pyramid representation using nonlinear feature coding and latent Support Vector Machine (LSVM) to train a set of Latent Pyramidal Regions (LPR). Each of our LPRs captures a discriminative characteristic of the scenes and is trained by searching over all possible sub-windows of the images in a latent SVM training procedure. The final response of the LPRs form a single feature vector which we call the LPR representation and can be used for the classification task.
Large-scale recognition problems with thousands of classes pose a particular challenge because applying the classifier requires more computation as the number of classes grows. The label tree model integrates classification with the traversal of the tree so that complexity grows logarithmically. We show how the parameters of the label tree can be found using maximum likelihood estimation. This new probabilistic learning technique produces a label tree with significantly improved recognition accuracy.