Aditya Kusupati

CV | Scholar | dblp
Github | Twitter | Calendar

I am a CS PhD student at University of Washington jointly advised by Ali Farhadi and Sham Kakade. My research interests lie in the intersection of Machine Learning, Computer Vision & Robotics (Multimodal Perception, shh! it is a secret). I also interned at NVIDIA Toronto Lab with Sanja Fidler and Antonio Torralba.

Before joining PhD, I spent two amazing years as a Research Fellow at Microsoft Research India with Manik Varma and Prateek Jain. In a past life, I earned a Bachelor's in CS with Honours and a Minor in EE from IIT Bombay where I had the pleasure of working with Soumen Chakrabarti.

Pro Bono: I have set aside 1 hr every week to help people/organizations who/that might benefit from my insights in using ML and CS to solve problems with societal impact. PhD applicants from underrepresented communities can also use this time to get feedback from me on their applications. Contact me via email to set up the slot.

* - equal contribution

[NEW] Are We Overfitting to Experimental Setups in Recognition?
Matthew Wallingford, Aditya Kusupati*, Keivan Alizadeh-Vahid*,
Aaron Walsman, Aniruddha Kembhavi and Ali Farhadi
Under Review, 2021

abstract / bibtex / pdf / arXiv / code / project page

Enabling robust intelligence in the real-world entails systems that offer continuous inference while learning from varying amounts of data and supervision. The machine learning community has organically broken down this challenging goal into manageable sub-tasks such as supervised, few-shot, and continual learning. In light of substantial progress on each sub-task, we pose the question, “How well does this progress translate to more practical scenarios?” To investigate this question, we construct a new framework, FLUID, which removes certain assumptions made by current experimental setups while integrating these sub-tasks via the following design choices -- consuming sequential data, allowing for flexible training phases, being compute aware, and working in an open-world setting. Evaluating a broad set of methods on FLUID leads to new insights including strong evidence that methods are overfitting to their experimental setup. For example, we find that representative few-shot methods are substantially worse than simple baselines, self-supervised representations from MoCo fail to learn new classes when the downstream task contains a mix of new and old classes, and pretraining largely mitigates the problem of catastrophic forgetting. Finally, we propose two new simple methods which outperform all other evaluated methods which further questions our progress towards robust, real-world systems..

  author    = {Wallingford, Matthew and Kusupati, Aditya 
    and Alizadeh-Vahid, Keivan and Walsman, Aaron and 
    Kembhavi, Aniruddha and Farhadi, Ali},
  title     = {Are We Overfitting 
    to Experimental Setups in Recognition?},
  booktitle = {arXiv preprint arXiv:2007.02519},
  year      = {2020},
  Conference Publications

RNNPool: Efficient Non-linear Pooling for
RAM Constrained Inference

Oindrila Saha, Aditya Kusupati, Harsha Vardhan Simhadri,
Manik Varma and Prateek Jain
Neural Information Processing Systems (NeurIPS), 2020

Virtual Spotlight presentation
abstract / bibtex / pdf / reviews / arXiv / code / video / poster / blog 1, 2
Also presented at the WiCV workshop @ CVPR, 2020

Standard Convolutional Neural Networks (CNNs) designed for computer vision tasks tend to have large intermediate activation maps. These require large working memory and are thus unsuitable for deployment on resource-constrained devices typically used for inference on the edge. Aggressively downsampling the images via pooling or strided convolutions can address the problem but leads to a significant decrease in accuracy due to gross aggregation of the feature map by standard pooling operators. In this paper, we introduce RNNPool, a novel pooling operator based on Recurrent Neural Networks (RNNs), that efficiently aggregates features over large patches of an image and rapidly downsamples activation maps. Empirical evaluation indicates that an RNNPool layer can effectively replace multiple blocks in a variety of architectures such as MobileNets, DenseNet when applied to standard vision tasks like image classification and face detection. That is, RNNPool can significantly decrease computational complexity and peak memory usage for inference while retaining comparable accuracy. We use RNNPool with the standard S3FD architecture to construct a face detection method that achieves state-of-the-art MAP for tiny ARM Cortex-M4 class microcontrollers with under 256 KB of RAM. Code is released at

  author    = {Saha, Oindrila and Kusupati, Aditya and 
    Simhadri, Harsha Vardhan and Varma, Manik and 
    Jain, Prateek},
  title     = {RNNPool: Efficient Non-linear Pooling 
    for RAM Constrained Inference},
  booktitle = {Advances in 
    Neural Information Processing Systems},
  month     = {December},
  year      = {2020},

Soft Threshold Weight Reparameterization for
Learnable Sparsity

Aditya Kusupati, Vivek Ramanujan*, Raghav Somani*, Mitchell Wortsman*, Prateek Jain, Sham Kakade and Ali Farhadi
International Conference on Machine Learning (ICML), 2020

Virtual Talk
abstract / bibtex / pdf / reviews / arXiv / code / video

Sparsity in Deep Neural Networks (DNNs) is studied extensively with the focus of maximizing prediction accuracy given an overall parameter budget. Existing methods rely on uniform or heuristic non-uniform sparsity budgets which have sub-optimal layer-wise parameter allocation resulting in a) lower prediction accuracy or b) higher inference cost (FLOPs). This work proposes Soft Threshold Reparameterization (STR), a novel use of the soft-threshold operator on DNN weights. STR smoothly induces sparsity while learning pruning thresholds thereby obtaining a non-uniform sparsity budget. Our method achieves state-of-the-art accuracy for unstructured sparsity in CNNs (ResNet50 and MobileNetV1 on ImageNet-1K), and, additionally, learns non-uniform budgets that empirically reduce the FLOPs by up to 50%. Notably, STR boosts the accuracy over existing results by up to 10% in the ultra sparse (99%) regime and can also be used to induce low-rank (structured sparsity) in RNNs. In short, STR is a simple mechanism which learns effective sparsity budgets that contrast with popular heuristics. Code, pretrained models and sparsity budgets are at

  author    = {Kusupati, Aditya and Ramanujan, Vivek and
    Somani, Raghav and Wortsman, Mitchell and 
    Jain, Prateek and Kakade, Sham and Farhadi, Ali},
  title     = {Soft Threshold Weight Reparameterization 
    for Learnable Sparsity},
  booktitle = {Proceedings of the International 
    Conference on Machine Learning},
  month     = {July},
  year      = {2020},

Extreme Regression for Dynamic Search Advertising
Yashoteja Prabhu, Aditya Kusupati, Nilesh Gupta and Manik Varma
International Conference on Web Search and Data Mining (WSDM), 2020

Long Oral presentation
abstract / bibtex / pdf / reviews / arXiv / code / poster / XML Repository
Also presented at the Workshop on eXtreme Classification: Theory and Applications @ ICML, 2020

This paper introduces a new learning paradigm called eXtreme Regression (XR) whose objective is to accurately predict the numerical degrees of relevance of an extremely large number of labels to a data point. XR can provide elegant solutions to many large-scale ranking and recommendation applications including Dynamic Search Advertising (DSA). XR can learn more accurate models than the recently popular extreme classifiers which incorrectly assume strictly binary-valued label relevances. Traditional regression metrics which sum the errors over all the labels are unsuitable for XR problems since they could give extremely loose bounds for the label ranking quality. Also, the existing regression algorithms won't efficiently scale to millions of labels. This paper addresses these limitations through: (1) new evaluation metrics for XR which sum only the k largest regression errors; (2) a new algorithm called XReg which decomposes XR task into a hierarchy of much smaller regression problems thus leading to highly efficient training and prediction. This paper also introduces a (3) new labelwise prediction algorithm in XReg useful for DSA and other recommendation tasks.
Experiments on benchmark datasets demonstrated that XReg can outperform the state-of-the-art extreme classifiers as well as large-scale regressors and rankers by up to 50% reduction in the new XR error metric, and up to 2% and 2.4% improvements in terms of the propensity-scored precision metric used in extreme classification and the click-through rate metric used in DSA respectively. Deployment of XReg on DSA in Bing resulted in a relative gain of 58% in revenue and 27% in query coverage. XReg's source code can be downloaded from

  author    = {Prabhu, Prabhu and Kusupati, Aditya and 
    Gupta, Nilesh and Varma, Manik},
  title     = {Extreme Regression for Dynamic 
    Search Advertising},
  booktitle = {Proceedings of the ACM International 
    Conference on Web Search and Data Mining},
  month     = {February},
  year      = {2020},

One Size Does Not Fit All: Multi-Scale, Cascaded RNNs for Radar Classification
Dhrubojyoti Roy*, Sangeeta Srivastava*, Aditya Kusupati, Pranshu Jain, Manik Varma and Anish Arora
International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys), 2019

Oral presentation 🏆 Best Paper Runner-Up Award
abstract / bibtex / pdf / reviews / arXiv / code / poster / dataset / news
Invited Paper in ACM Transactions on Sensor Networks (TOSN), 2021

Edge sensing with micro-power pulse-Doppler radars is an emergent domain in monitoring and surveillance with several smart city applications. Existing solutions for the clutter versus multi-source radar classification task are limited in terms of either accuracy or efficiency, and in some cases, struggle with a trade-off between false alarms and recall of sources. We find that this problem can be resolved by learning the classifier across multiple time-scales. We propose a multi-scale, cascaded recurrent neural network architecture, MSC-RNN, comprised of an efficient multi-instance learning (MIL) Recurrent Neural Network (RNN) for clutter discrimination at a lower tier, and a more complex RNN classifier for source classification at the upper tier. By controlling the invocation of the upper RNN with the help of the lower tier conditionally, MSC-RNN achieves an overall accuracy of 0.972. Our approach holistically improves the accuracy and per-class recalls over machine learning models suitable for radar inferencing. Notably, we outperform cross-domain handcrafted feature engineering with purely time-domain deep feature learning, while also being up to ~3x more efficient than a competitive solution.

  author    = {Roy, Dhrubojyoti and Srivastava, Sangeeta 
    and Kusupati, Aditya and Jain, Pranshu and 
    Varma, Manik and Arora, Anish},
  title     = {One Size Does Not Fit All: 
    Multi-Scale, Cascaded RNNs for 
    Radar Classification},
  booktitle = {Proceedings of the ACM International 
    Conference on Systems for Energy-Efficient 
    Buildings, Cities, and Transportation},
  month     = {November},
  year      = {2019},

FastGRNN: A Fast, Accurate, Stable and Tiny Kilobyte Sized Gated Recurrent Neural Network
Aditya Kusupati, Manish Singh, Kush Bhatia, Ashish Kumar, Prateek Jain and Manik Varma
Neural Information Processing Systems (NeurIPS), 2018

abstract / bibtex / pdf / reviews / arXiv / code / video / poster / datasets / blog

This paper develops the FastRNN and FastGRNN algorithms to address the twin RNN limitations of inaccurate training and inefficient prediction. Previous approaches have improved accuracy at the expense of prediction costs making them infeasible for resource-constrained and real-time applications. Unitary RNNs have increased accuracy somewhat by restricting the range of the state transition matrix's singular values but have also increased the model size as they require a larger number of hidden units to make up for the loss in expressive power. Gated RNNs have obtained state-of-the-art accuracies by adding extra parameters thereby resulting in even larger models. FastRNN addresses these limitations by adding a residual connection that does not constrain the range of the singular values explicitly and has only two extra scalar parameters. FastGRNN then extends the residual connection to a gate by reusing the RNN matrices to match state-of-the-art gated RNN accuracies but with a 2-4x smaller model. Enforcing FastGRNN's matrices to be low-rank, sparse and quantized resulted in accurate models that could be up to 35x smaller than leading gated and unitary RNNs. This allowed FastGRNN to accurately recognize the "Hey Cortana" wakeword with a 1 KB model and to be deployed on severely resource-constrained IoT microcontrollers too tiny to store other RNN models. FastGRNN's code is available at

  author    = {Kusupati, Aditya and Singh, Manish and 
    Bhatia, Kush and Kumar, Ashish and 
    Jain, Prateek and Varma, Manik},
  title     = {{FastGRNN}: A Fast, Accurate, 
    Stable and Tiny Kilobyte Sized 
    Gated Recurrent Neural Network.},
  booktitle = {Advances in 
    Neural Information Processing Systems},
  month     = {December},
  year      = {2018},

  Technical Reports
Conclusions of the paper

Adapting Unstructured Sparsity Techniques for Structured Sparsity
Aditya Kusupati
Technical Report, 2020

abstract / bibtex / pdf / code

Unstructured and structured sparsities provide unique advantages in resource-efficient sparse neural networks. Unstructured sparsity can assist in obtaining highly sparse and accurate models, while structured sparsity focuses mainly on enabling fast parallelizable inference on commodity hardware (e.g. GPUs). In the recent past, these distinctive advantages led to the divergence of the sub-fields leading to a disconnect. In this report, we propose and argue that most recent advances in unstructured sparsity can be adapted for inducing structured sparsity in deep neural networks. We also note the similarities between both these two sub-fields and document how the solutions from unstructured sparsity can be leveraged in solving the issues of structured sparsity. We also showcase the ease of adaptation by proposing STR-BN which is an application of the recently proposed STR method on batch normalization to induce structured sparsity via filter/neuron pruning. Code for STR-BN can be found at

  author    = {Kusupati, Aditya},
  title     = {Adapting Unstructured Sparsity 
    Techniques for Structured Sparsity},
  booktitle = {Technical Report},
  month     = {August},
  year      = {2020},

Radar for MSC-RNN demo

Lightweight, Deep RNNs for Radar Classification
Dhrubojyoti Roy*, Sangeeta Srivastava*, Pranshu Jain, Aditya Kusupati, Manik Varma and Anish Arora
International Conference on Systems for Energy-Efficient Buildings, Cities, and Transportation (BuildSys), 2019

abstract / bibtex / pdf

We demonstrate Multi-Scale, Cascaded RNN (MSC-RNN), an energy-efficient recurrent neural network for real-time micro-power radar classification. Its two-tier architecture is jointly trained to reject clutter and discriminate displacing sources at different time-scales, with a lighter lower tier running continuously and a heavier upper tier invoked infrequently on an on-demand basis. It offers for single microcontroller devices a better trade-off in accuracy and efficiency, as well as in clutter suppression and detectability, over competitive shallow and deep alternatives.

  author    = {Roy, Dhrubojyoti and Srivastava, Sangeeta 
    and Jain, Pranshu and Kusupati, Aditya and 
    Varma, Manik and Arora, Anish},
  title     = {Lightweight, Deep RNNs 
    for Radar Classification},
  booktitle = {Proceedings of the ACM International 
    Conference on Systems for Energy-Efficient 
    Buildings, Cities, and Transportation},
  month     = {November},
  year      = {2019},

geometric embeddings

Efficient Spatial Representation for Entity-Typing
Anand Dhoot*, Aditya Kusupati* and Soumen Chakrabarti
Undergraduate Thesis, CSE IIT Bombay, 2016-17

abstract / bibtex / pdf

The project aims at creating a efficient spatial embeddings for entities and types which would be useful for various downstream tasks such as Knowledge Base Completion, Fine-Type Tagging and Question Answering.

  author = {Dhoot, Anand and Kusupati, Aditya 
    and Chakrabarti, Soumen},
  title = {Efficient Spatial Representation 
    for Entity-Typing},
  booktitle = {Undergraduate Thesis, CSE IIT Bombay},
  year = {2016-17},


EdgeML: Machine Learning for resource-constrained edge devices
Work of many amazing collaborators. I was one of the initial and primary contributors.
Github, Microsoft Research India, 2017-present.

abstract / bibtex

Open source repository for all the research outputs on resource efficient Machine Learning from Microsoft Research India. It contains scalable and multi-framework compatible implementations of Bonsai, ProtoNN, FastCells, EMI-RNN, ShaRNN, RNNPool, DROCC, a tool named SeeDot for fixed-point compilation of ML models along with applications such as on-device Keyword spotting and Gesturepod.
EdgeML is under MIT license and is open to contributions and suggestions. Please cite the software if you happen to use EdgeML in your research or otherwise (use the latest bibtex from the repository).

    author = {{Dennis, Don Kurian and Gaurkar, Yash and 
      Gopinath, Sridhar and Goyal, Sachin 
      and Gupta, Chirag and Jain, Moksh 
      and Kumar, Ashish and Kusupati, Aditya 
      and Lovett, Chris and Patil, Shishir Girish 
      and Oindrila Saha and Simhadri, Harsha Vardhan}},
    title = {{EdgeML: Machine Learning 
      for resource-constrained edge devices}},
    url = {},
    version = {0.3},

CS226/254: Digital Logic Design + Lab - Spring '17, IIT Bombay

CS251: Software Systems Lab - Fall '16, IIT Bombay

CS226/254: Digital Logic Design + Lab - Spring '16, IIT Bombay

CS101: Computer Programming and Utilisation - Fall '15, IIT Bombay

CS101: Computer Programming and Utilisation - Spring '15, IIT Bombay

  • Soft Threshold Weight Reparameterization for Learnable Sparsity
    • International Conference on Machine Learning (ICML) (July '20)
    • NVIDIA Research (July '20)
    • Deep Learning: Classics and Trends (June '20)
  • The Edge of Machine Learning
    • University of Washington Sensor Systems Seminar (October '19)
    • University of Washington CSE Colloquium (October '19)
    • VGG @ Oxford University, UK on (April '19)
    • Microsoft Research Redmond (March '19)
    • Microsoft Research India (August '18)
  • The Extremes of Machine Learning
    • Microsoft Bing Bellevue (March '19)
This Flag counter seemed fun. It shows counts since March 24, 2020. Flag Counter

Template: this, this and this