Zihao Ye (叶子豪) - University of Washington

About Me

This is Zihao Ye, a third-year Ph.D. student at the University of Washington's Paul G. Allen School of Computer Science and Engineering, advised by Luis Ceze in the SAMPL research group. I also work closely with Tianqi Chen on Machine Learning Compilers.

Prior to joining UW, I spent two years at AWS where I worked with Minjie Wang and Zheng Zhang. I obtained my bachelor's degree from ACM Honors Class at Shanghai Jiao Tong University.

We are organizing talks at SAMPL, topics include Systems, Architecture, Compilers, Verification and Machine Learning.

Research

My current research focuses on Machine Learning Compilers and Sparse Computation.

Feel free to drop me an email if we have aligned interests, and I'm open to collaborations.

News

Jun 2024 Starting my internship at NVIDIA on Deep Learning Compilers.
Apr 2024 I have recovered from a long period of burnout and depression. Now I am ready to get back to work. I am grateful for the support of my friends and family during this time.
Jan 2024 Check our latest blog post Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding which explains how accelerate shared-prefix LLM serving with cascade inference, related APIs have been integrated into FlashInfer.
Jan 2024 Check our blog post Accelerating Self-Attentions for LLM Serving with FlashInfer for the introduction of FlashInfer project, where we also explain the characteristics of Attention Operators in LLM Serving.
Dec 2023 Excited to be a recipient of NVIDIA Graduate Fellowship 2024-2025, thank NVIDIA and all my collaborators over the years, let's keep doing cool research!
Oct 2023 Check our latest LLM serving system Punica, which serves multiple LoRA models with the same latency as a single model! Our paper and the blog post written by Lequn Chen explains the motivation and design of Punica.
Oct 2023 We released FlashInfer: Kernel Library for LLM Serving, which accelerates LLM operators in serving. We maintained a blog where we write blog posts to explain the techniques that accelerate LLMs, we hope this would be useful for people working in this field.
Jun 2023 I'll give a talk on Sparsity in LLMs at CTSTA @ PLDI 2023, see you all in Orlando!
May 2023 Please take a look at the MLC-LLM project, which enables the deployment of LLM on a wide range of hardware platforms. I am honored to be a part of this project and collaborate with an extraordinary team!
Mar 2023 SparseTIR has been awarded Distinguished Artifact at ASPLOS 2023!
Feb 2023 I'm going to be a TA for CSE 599M in Spring 2023, which is ML for ML Systems taught by Luis Ceze. I'm really excited to be a part of this new course.
Jan 2023 We are proud to announce that SparseTIR will be featured at ASPLOS 2023. We'll be heading to Vancouver and can't wait to see everyone there!

Active Projects

FlashInfer: Kernel Library for LLM Serving
MLC-LLM: Universal Deployment of Large Language Models

I'm excited to be part of MLC community and collaborate with a strong team on the following projects in TVM Unity:

Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
TensorIR: Tensor-Level Abstractions for Deep Learning Operators

Earlier Projects

Punica: Serving multiple LoRA fineuned LLM as one
SparseTIR: Compiler Abstraction for Sparse Operators in Deep Learning, built upon Apache TVM.
- ASPLOS Artifact: scripts to reproduce ASPLOS paper results.
DGL: Efficient and Scalable Deep Learning on Graphs

Selected Publications

ASPLOS 2023

SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning.
Zihao Ye, Ruihang Lai, Junru Shao, Tianqi Chen, and Luis Ceze.
The 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023. Distinguished Artifact Award.