Aditya Kusupati

CV | Scholar | dblp
Github | Twitter | Calendar

I am a CS PhD student at University of Washington jointly advised by Ali Farhadi and Sham Kakade while working closely with Prateek Jain at Google Research as a Student Researcher. This Summer, I am visiting Alyosha Efros at BAIR. My research interests lie in the intersection of Machine Learning & Computer Vision with a focus on real-world deployability.

Before joining PhD, I spent two amazing years as a Research Fellow at Microsoft Research India with Manik Varma and Prateek Jain. In a past life, I earned a Bachelor's in CS with Honours and a Minor in EE from IIT Bombay where I had the pleasure of working with Soumen Chakrabarti.

Pro Bono: I have set aside 1 hr every week to help people/organizations who/that might benefit from my insights in using ML and CS to solve problems with societal impact. PhD applicants from underrepresented communities can also use this time to get feedback from me on their applications. Contact me via email to set up the slot.

* - equal contribution

FLUID: A Unified Evaluation Framework for Flexible Sequential Data
Matthew Wallingford, Aditya Kusupati, Keivan Alizadeh-Vahid, Aaron Walsman, Aniruddha Kembhavi and Ali Farhadi
Under Review, 2022

abstract / bibtex / pdf / arXiv / code / project page

Enabling robust intelligence in the real-world entails systems that offer continuous inference while learning from varying amounts of data and supervision. The machine learning community has organically broken down this challenging goal into manageable sub-tasks such as supervised, few-shot, and continual learning. In light of substantial progress on each sub-task, we pose the question, “How well does this progress translate to more practical scenarios?” To investigate this question, we construct a new framework, FLUID, which removes certain assumptions made by current experimental setups while integrating these sub-tasks via the following design choices -- consuming sequential data, allowing for flexible training phases, being compute aware, and working in an open-world setting. Evaluating a broad set of methods on FLUID leads to new insights including strong evidence that methods are overfitting to their experimental setup. For example, we find that representative few-shot methods are substantially worse than simple baselines, self-supervised representations from MoCo fail to learn new classes when the downstream task contains a mix of new and old classes, and pretraining largely mitigates the problem of catastrophic forgetting. Finally, we propose two new simple methods which outperform all other evaluated methods which further questions our progress towards robust, real-world systems.

  author    = {Wallingford, Matthew and Kusupati, Aditya 
    and Alizadeh-Vahid, Keivan and Walsman, Aaron and 
    Kembhavi, Aniruddha and Farhadi, Ali},
  title     = {FLUID: A Unified Evaluation Framework 
    for Flexible Sequential Data.},
  booktitle = {arXiv preprint arXiv:2007.02519},
  year      = {2020},
  Conference Publications

[CVPR'22] MERLOT RESERVE: Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi and Yejin Choi.
Conference on Computer Vision and Pattern Recognition (CVPR), 2022

abstract / bibtex / pdf / arXiv / code / project page

As humans, we navigate the world through all our senses, using perceptual input from each one to correct the others. We introduce MERLOT Reserve, a model that represents videos jointly over time -- through a new training objective that learns from audio, subtitles, and video frames. Given a video, we replace snippets of text and audio with a MASK token; the model learns by choosing the correct masked-out snippet. Our objective learns faster than alternatives, and performs well at scale: we pretrain on 20 million YouTube videos. Empirical results show that MERLOT Reserve learns strong representations about videos through all constituent modalities. When finetuned, it sets a new state-of-the-art on both VCR and TVQA, outperforming prior work by 5% and 7% respectively. Ablations show that both tasks benefit from audio pretraining -- even VCR, a QA task centered around images (without sound). Moreover, our objective enables out-of-the-box prediction, revealing strong multimodal commonsense understanding. In a fully zero-shot setting, our model obtains competitive results on four video understanding tasks, even outperforming supervised approaches on the recently proposed Situated Reasoning (STAR) benchmark. We analyze why incorporating audio leads to better vision-language representations, suggesting significant opportunities for future research. We conclude by discussing ethical and societal implications of multimodal pretraining..

    title={MERLOT Reserve: Multimodal Neural Script Knowledge through Vision and Language and Sound},
    author={Zellers, Rowan and Lu, Jiasen and Lu, Ximing and Yu, Youngjae and Zhao, Yanpeng and Salehi, Mohammadreza and Kusupati, Aditya and Hessel, Jack and Farhadi, Ali and Choi, Yejin},

[CHI'22] ProtoSound: Personalized, Scalable Sound Recognition for d/Deaf and Hard of Hearing Users through In-the-Wild Few-Shot Interactions
Dhruv Jain, Khoa Nguyen, Steven Goodman, Rachel Grossman-Kahn, Hung Ngo, Aditya Kusupati, Ruofei Du, Alex Olwal, Leah Findlater and Jon Froehlich
Conference on Human Factors in Computing Systems (CHI), 2022

abstract / bibtex / pdf / reviews / arXiv / code / video

Recent advances have enabled automatic sound recognition systems for deaf and hard of hearing (DHH) users on mobile devices. However, these tools use pre-trained, generic sound recognition models, which do not meet the diverse needs of DHH users. We introduce ProtoSound, an interactive system for customizing sound recognition models by recording a few examples, thereby enabling personalized and fine-grained categories. ProtoSound is motivated by prior work examining sound awareness needs of DHH people and by a survey we conducted with 472 DHH participants. To evaluate ProtoSound, we characterized performance on two real-world sound datasets, showing significant improvement over state-of-the-art (e.g., +9.7% accuracy on the first dataset). We then deployed ProtoSound's end-user training and real-time recognition through a mobile application and recruited 19 hearing participants who listened to the real-world sounds and rated the accuracy across 56 locations (e.g., homes, restaurants, parks). Results show that ProtoSound personalized the model on-device in real-time and accurately learned sounds across diverse acoustic contexts. We close by discussing open challenges in personalizable sound recognition, including the need for better recording interfaces and algorithmic improvements.


[NeurIPS'21] LLC: Accurate, Multi-purpose Learnt Low-dimensional Binary Codes
Aditya Kusupati, Matthew Wallingford, Vivek Ramanujan, Raghav Somani, Jae Sung Park, Krishna Pillutla, Prateek Jain, Sham Kakade and Ali Farhadi
Neural Information Processing Systems (NeurIPS), 2021

abstract / bibtex / pdf / reviews / arXiv / code / video / poster

Learning binary representations of instances and classes is a classical problem with several high potential applications. In modern settings, the compression of high-dimensional neural representations to low-dimensional binary codes is a challenging task and often require large bit-codes to be accurate. In this work, we propose a novel method for Learning Low-dimensional binary Codes (LLC) for instances as well as classes. Our method does not require any side-information, like annotated attributes or label meta-data, and learns extremely low-dimensional binary codes (~20 bits for ImageNet-1K). The learnt codes are super-efficient while still ensuring nearly optimal classification accuracy for ResNet50 on ImageNet-1K. We demonstrate that the learnt codes capture intrinsically important features in the data, by discovering an intuitive taxonomy over classes. We further quantitatively measure the quality of our codes by applying it to the efficient image retrieval as well as out-of-distribution (OOD) detection problems. For ImageNet-100 retrieval problem, our learnt binary codes outperform 16 bit HashNet using only 10 bits and also are as accurate as 10 dimensional real representations. Finally, our learnt binary codes can perform OOD detection, out-of-the-box, as accurately as a baseline that needs ~3000 samples to tune its threshold, while we require none. Code and pre-trained models are available at

  author    = {Kusupati, Aditya and Wallingford, Matthew 
    and Ramanujan, Vivek and Somani, Raghav and 
    Park, Jae Sung and Pillutla, Krishna and
    Jain, Prateek and Kakade, Sham and Farhadi, Ali},
  title     = {LLC: Accurate, Multi-purpose 
    Learnt Low-dimensional Binary Codes},
    booktitle = {Advances in 
      Neural Information Processing Systems},
    month     = {December},
    year      = {2021},

[NeurIPS'20] RNNPool: Efficient Non-linear Pooling for RAM Constrained Inference
Oindrila Saha, Aditya Kusupati, Harsha Vardhan Simhadri, Manik Varma and Prateek Jain
Neural Information Processing Systems (NeurIPS), 2020

Virtual Spotlight presentation
abstract / bibtex / pdf / reviews / arXiv / code / video / poster / blog 1, 2
Also presented at the WiCV workshop @ CVPR, 2020

Standard Convolutional Neural Networks (CNNs) designed for computer vision tasks tend to have large intermediate activation maps. These require large working memory and are thus unsuitable for deployment on resource-constrained devices typically used for inference on the edge. Aggressively downsampling the images via pooling or strided convolutions can address the problem but leads to a significant decrease in accuracy due to gross aggregation of the feature map by standard pooling operators. In this paper, we introduce RNNPool, a novel pooling operator based on Recurrent Neural Networks (RNNs), that efficiently aggregates features over large patches of an image and rapidly downsamples activation maps. Empirical evaluation indicates that an RNNPool layer can effectively replace multiple blocks in a variety of architectures such as MobileNets, DenseNet when applied to standard vision tasks like image classification and face detection. That is, RNNPool can significantly decrease computational complexity and peak memory usage for inference while retaining comparable accuracy. We use RNNPool with the standard S3FD architecture to construct a face detection method that achieves state-of-the-art MAP for tiny ARM Cortex-M4 class microcontrollers with under 256 KB of RAM. Code is released at

  author    = {Saha, Oindrila and Kusupati, Aditya and 
    Simhadri, Harsha Vardhan and Varma, Manik and 
    Jain, Prateek},
  title     = {RNNPool: Efficient Non-linear Pooling 
    for RAM Constrained Inference},
  booktitle = {Advances in 
    Neural Information Processing Systems},
  month     = {December},
  year      = {2020},

[ICML'20] Soft Threshold Weight Reparameterization for Learnable Sparsity
Aditya Kusupati, Vivek Ramanujan*, Raghav Somani*, Mitchell Wortsman*, Prateek Jain, Sham Kakade and Ali Farhadi
International Conference on Machine Learning (ICML), 2020

Virtual Talk
abstract / bibtex / pdf / reviews / arXiv / code / video

Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus of maximizing prediction accuracy given an overall parameter budget. Existing methods rely on uniform or heuristic non-uniform sparsity budgets which have sub-optimal layer-wise parameter allocation resulting in a) lower prediction accuracy or b) higher inference cost (FLOPs). This work proposes Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights. STR smoothly induces sparsity while learning pruning thresholds thereby obtaining a non-uniform sparsity budget. Our method achieves state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy over existing results by up to 10% in the ultra sparse (99%) regime and can also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is a simple mechanism which learns effective sparsity budgets that contrast with popular heuristics. Code, pretrained models and sparsity budgets are at

  author    = {Kusupati, Aditya and Ramanujan, Vivek and
    Somani, Raghav and Wortsman, Mitchell and 
    Jain, Prateek and Kakade, Sham and Farhadi, Ali},
  title     = {Soft Threshold Weight Reparameterization 
    for Learnable Sparsity},
  booktitle = {Proceedings of the International 
    Conference on Machine Learning},
  month     = {July},
  year      = {2020},

[WSDM'20] Extreme Regression for Dynamic Search Advertising
Yashoteja Prabhu, Aditya Kusupati, Nilesh Gupta and Manik Varma
International Conference on Web Search and Data Mining (WSDM), 2020

Long Oral presentation
abstract / bibtex / pdf / reviews / arXiv / code / poster / XML Repository
Also presented at the Workshop on eXtreme Classification: Theory and Applications @ ICML, 2020

This paper introduces a new learning paradigm called eXtreme Regression (XR) whose objective is to accurately predict the numerical degrees of relevance of an extremely large number of labels to a data point. XR can provide elegant solutions to many large-scale ranking and recommendation applications including Dynamic Search Advertising (DSA). XR can learn more accurate models than the recently popular extreme classifiers which incorrectly assume strictly binary-valued label relevances. Traditional regression metrics which sum the errors over all the labels are unsuitable for XR problems since they could give extremely loose bounds for the label ranking quality. Also, the existing regression algorithms won't efficiently scale to millions of labels. This paper addresses these limitations through: (1) new evaluation metrics for XR which sum only the k largest regression errors; (2) a new algorithm called XReg which decomposes XR task into a hierarchy of much smaller regression problems thus leading to highly efficient training and prediction. This paper also introduces a (3) new labelwise prediction algorithm in XReg useful for DSA and other recommendation tasks.
Experiments on benchmark datasets demonstrated that XReg can outperform the state-of-the-art extreme classifiers as well as large-scale regressors and rankers by up to 50% reduction in the new XR error metric, and up to 2% and 2.4% improvements in terms of the propensity-scored precision metric used in extreme classification and the click-through rate metric used in DSA respectively. Deployment of XReg on DSA in Bing resulted in a relative gain of 58% in revenue and 27% in query coverage. XReg's source code can be downloaded from

  author    = {Prabhu, Prabhu and Kusupati, Aditya and 
    Gupta, Nilesh and Varma, Manik},
  title     = {Extreme Regression for Dynamic 
    Search Advertising},
  booktitle = {Proceedings of the ACM International 
    Conference on Web Search and Data Mining},
  month     = {February},
  year      = {2020},

[BuildSys'19] One Size Does Not Fit All: Multi-Scale, Cascaded RNNs for Radar Classification
Dhrubojyoti Roy*, Sangeeta Srivastava*, Aditya Kusupati, Pranshu Jain, Manik Varma and Anish Arora
International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys), 2019

Oral presentation 🏆 Best Paper Runner-Up Award
abstract / bibtex / pdf / reviews / arXiv / code / poster / dataset / news
Invited Paper in ACM Transactions on Sensor Networks (TOSN), 2021

Edge sensing with micro-power pulse-Doppler radars is an emergent domain in monitoring and surveillance with several smart city applications. Existing solutions for the clutter versus multi-source radar classification task are limited in terms of either accuracy or efficiency, and in some cases, struggle with a trade-off between false alarms and recall of sources. We find that this problem can be resolved by learning the classifier across multiple time-scales. We propose a multi-scale, cascaded recurrent neural network architecture, MSC-RNN, comprised of an efficient multi-instance learning (MIL) Recurrent Neural Network (RNN) for clutter discrimination at a lower tier, and a more complex RNN classifier for source classification at the upper tier. By controlling the invocation of the upper RNN with the help of the lower tier conditionally, MSC-RNN achieves an overall accuracy of 0.972. Our approach holistically improves the accuracy and per-class recalls over machine learning models suitable for radar inferencing. Notably, we outperform cross-domain handcrafted feature engineering with purely time-domain deep feature learning, while also being up to ~3x more efficient than a competitive solution.

  author    = {Roy, Dhrubojyoti and Srivastava, Sangeeta 
    and Kusupati, Aditya and Jain, Pranshu and 
    Varma, Manik and Arora, Anish},
  title     = {One Size Does Not Fit All: 
    Multi-Scale, Cascaded RNNs for 
    Radar Classification},
  booktitle = {Proceedings of the ACM International 
    Conference on Systems for Energy-Efficient 
    Buildings, Cities, and Transportation},
  month     = {November},
  year      = {2019},

[NeurIPS'18] FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network
Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Kumar, Prateek Jain and Manik Varma
Neural Information Processing Systems (NeurIPS), 2018

abstract / bibtex / pdf / reviews / arXiv / code / video / poster / datasets / blog

This paper develops the FastRNN and FastGRNN algorithms to address the twin RNN limitations of inaccurate training and inefficient prediction. Previous approaches have improved accuracy at the expense of prediction costs making them infeasible for resource-constrained and real-time applications. Unitary RNNs have increased accuracy somewhat by restricting the range of the state transition matrix's singular values but have also increased the model size as they require a larger number of hidden units to make up for the loss in expressive power. Gated RNNs have obtained state-of-the-art accuracies by adding extra parameters thereby resulting in even larger models. FastRNN addresses these limitations by adding a residual connection that does not constrain the range of the singular values explicitly and has only two extra scalar parameters. FastGRNN then extends the residual connection to a gate by reusing the RNN matrices to match state-of-the-art gated RNN accuracies but with a 2-4x smaller model. Enforcing FastGRNN's matrices to be low-rank, sparse and quantized resulted in accurate models that could be up to 35x smaller than leading gated and unitary RNNs. This allowed FastGRNN to accurately recognize the "Hey Cortana" wakeword with a 1 KB model and to be deployed on severely resource-constrained IoT microcontrollers too tiny to store other RNN models. FastGRNN's code is available at

  author    = {Kusupati, Aditya and Singh, Manish and 
    Bhatia, Kush and Kumar, Ashish and 
    Jain, Prateek and Varma, Manik},
  title     = {{FastGRNN}: A Fast, Accurate, 
    Stable and Tiny Kilobyte Sized 
    Gated Recurrent Neural Network.},
  booktitle = {Advances in 
    Neural Information Processing Systems},
  month     = {December},
  year      = {2018},

  Workshop Publications

[AdvML@ICML'21] Disrupting Model Training with Adversarial Shortcuts
Ivan Evtimov, Ian Covert, Aditya Kusupati and Tadayoshi Kohno
Workshop on Adversarial Machine Learning @ ICML, 2021

abstract / bibtex / pdf / arXiv / code

When data is publicly released for human consumption, it is unclear how to prevent its unauthorized usage for machine learning purposes. Successful model training may be preventable with carefully designed dataset modifications, and we present a proof-of-concept approach for the image classification setting. We propose methods based on the notion of adversarial shortcuts, which encourage models to rely on non-robust signals rather than semantic features, and our experiments demonstrate that these measures successfully prevent deep learning models from achieving high accuracy on real, unmodified data examples.

  author    = {Evtimov, Ivan and Covert, Ian
    and Kusupati, Aditya and Kohno, Tadayoshi},
  title     = {Disrupting Model Training 
    with Adversarial Shortcuts},
  booktitle = {arXiv preprint arXiv:2106.06654},
  year      = {2021},

geometric embeddings

[UG Thesis'17] Efficient Spatial Representation for Entity-Typing
Anand Dhoot*, Aditya Kusupati* and Soumen Chakrabarti
Undergraduate Thesis, CSE IIT Bombay, 2016-17

abstract / bibtex / pdf

The project aims at creating a efficient spatial embeddings for entities and types which would be useful for various downstream tasks such as Knowledge Base Completion, Fine-Type Tagging and Question Answering.

  author = {Dhoot, Anand and Kusupati, Aditya 
    and Chakrabarti, Soumen},
  title = {Efficient Spatial Representation 
    for Entity-Typing},
  booktitle = {Undergraduate Thesis, CSE IIT Bombay},
  year = {2016-17},


EdgeML: Machine Learning for resource-constrained edge devices
Work of many amazing collaborators. I was one of the initial and primary contributors.
Github, Microsoft Research India, 2017-present.

abstract / bibtex
    author = {{Dennis, Don Kurian and Gaurkar, Yash and 
      Gopinath, Sridhar and Goyal, Sachin 
      and Gupta, Chirag and Jain, Moksh 
      and Kumar, Ashish and Kusupati, Aditya 
      and Lovett, Chris and Patil, Shishir Girish 
      and Oindrila Saha and Simhadri, Harsha Vardhan}},
    title = {{EdgeML: Machine Learning 
      for resource-constrained edge devices}},
    url = {},
    version = {0.3},
  Teaching, Service & Talks
    Check my CV for these details.
This Flag counter seemed fun. It shows counts since March 24, 2020. Flag Counter

Template: this, this and this