About me
I am a fifth year PhD student in computer science at The University of Washington, where I work under Hannaneh Hajishirzi and Ali Farhadi and collaborate closely with Ranjay Krishna. My research spans pretraining, post-training, and benchmarking multi-modal large language models, specifically video language models. Prior to UW, I received my B.Sc. in computer engineering from Sharif University of Technology. I publish under my full name Mohammadreza, and go by Reza among my friends. Outside of work, I'm into nature, hiking, biking, going to the gym, running, and reading books. I have a small page where I write my thoughts about the books I read.
News
- 02/2026 Molmo2 and VideoNet accepted to CVPR 2026.
- 02/2025 Molmo accepted to CVPR 2025.
- 09/2024 Action Atlas accepted to NeurIPS 2024 D&B.
- 09/2024 Check out Molmo, our new vision language model!
- 05/2024 CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement accepted to TMLR.
- 10/2023 Our paper CLIP meets Model Zoo Experts got accepted to UniReps Workshop at NeurIPS'23.
- 10/2023 SHARCS accepted to EMNLP'23 Findings.
- 09/2023 Our research proposal won the Microsoft Accelerate Foundation Models Research grant.
- 06/2023 Our team Sherlock AI won the first place prize ($20k+ worth of cloud and API credits) at AI Tinkerers Gen. AI hackathon and got featured on Cohere's blog.
- 05/2023 Our LLM powered gmail assistant won the first place in fixie.ai hackathon.
- 01/2023 Started my internship with the AIML team at Apple.
Publications
View all →ActionAtlas: A VideoQA Benchmark for Domain-specialized Action Recognition
Mohammadreza Salehi, Jae Sung Park, Tanush Yadav, Aditya Kusupati, Ranjay Krishna, Yejin Choi, Hanna Hajishirzi, Ali Farhadi
NeurIPS 2024 D&B
TL;DR ▶
A benchmark for evaluating foundation models on complex actions which requires large enough frame sampling rate to understand fast movements, tracking, and understanding sequence of fine movements.
Molmo and PixMo: Open Weights and Open Data for State-of-the-Art Multimodal Models
M. Deitke, C. Clark, S. Lee, R. Tripathi, Y. Yang, J.S. Park, M. Salehi et al.
TL;DR ▶
We present Molmo, a new family of VLMs that are state-of-the-art in their class of openness. The best-in-class 72B model within the Molmo family not only outperforms others in the class of open weight and data models but also compares favorably against proprietary systems like GPT-4o, Claude 3.5, and Gemini 1.5 on both academic benchmarks and human evaluation.
CLIP meets Model Zoo Experts: Pseudo-Supervision for Visual Enhancement
Mohammadreza Salehi, Mehrdad Farajtabar, Maxwell Horton, Fartash Faghri, Hadi Pouransari, Raviteja Vemulapalli, Oncel Tuzel, Ali Farhadi, Mohammad Rastegari, Sachin Mehta
TMLR'24
TL;DR ▶
We showed how one can merge any task-specific expert model from open-source model zoos into foundation models (FMs) such as CLIP. This enhances the visual features of FM for dense prediction and localization tasks without collecting any supervised data.
SHARCS: Efficient Transformers through Routing with Dynamic Width Sub-networks
Mohammadreza Salehi, Sachin Mehta, Aditya Kusupati, Ali Farhadi, Hannaneh Hajishirzi
EMNLP'23 Findings
TL;DR ▶
We introduced a new sample adaptive inference method called SHARCS. It routes samples to different sub-networks with varying widths within any transformer network based on the hardness of input sample.
Attentional Mixtures of Soft Prompt Tuning for Parameter-efficient Multi-task Knowledge Sharing
Akari Asai, Mohammadreza Salehi, Matthew E. Peters, Hannaneh Hajishirzi
EMNLP'22
TL;DR ▶
We introduced a new parameter-efficient fine-tuning method based on prompt tuning. In our method, prompts for some source tasks are learnt and for each sample in a new target task an attentional mixture of source prompts is used as the target prompt.
MERLOT Reserve: Multimodal Neural Script Knowledge through Vision and Language and Sound
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusupati, Jack Hessel, Ali Farhadi, Yejin Choi
CVPR'22
TL;DR ▶
We introduce MERLOT Reserve, which learns from 20 million YouTube videos through all their modalities (audio, vision, and text). Learning from audio helps broadly -- even on single-image tasks like VCR. Our model learns state-of-the-art representations, that also transfer well to video-based tasks in a zero-shot setting.
Paraphrase Generation by Learning How to Edit from Samples
Amirhossein Kazemnejad, Mohammadreza Salehi, Mahdieh Soleymani Baghshah
ACL'2020
TL;DR ▶
Paraphrase generation by retrieving similar paraphrase pairs from a pre-existing corpus and editing them using multi-head attention mechanism.