About Me
This is Zihao Ye, a third-year Ph.D. student at the University of Washington's Paul G. Allen School of Computer Science and Engineering, advised by Luis Ceze in the SAMPL research group. I also work closely with Tianqi Chen on Machine Learning Compilation.
Prior to joining UW, I spent two years at AWS where I worked with Minjie Wang and Zheng Zhang. I obtained my bachelor's degree from ACM Honors Class at Shanghai Jiao Tong University.
We are organizing talks at SAMPL, topics include Systems, Architecture, Compilers, Verification and Machine Learning.
Feeling overwhelmed by all the pressure and competition in LLM System research? I've got some suggestions that might help.
Research
My current research focuses on Machine Learning Compilation, Sparse Computation and Systems for Foundation Models Serving (the field is so toxic, I quit).
Feel free to drop me an email if we have aligned interests, and I'm open to collaborations.
News
- Apr 2024 I have recovered from a long period of burnout and depression. Now I am ready to get back to work. I am grateful for the support of my friends and family during this time.
- Feb 2024 Punica and Atom have been accepted to MLSys 2024, congratulations to the team!
- Jan 2024 Check our latest blog post Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding which explains how accelerate shared-prefix LLM serving with cascade inference, related APIs have been integrated into FlashInfer.
- Jan 2024 Check our blog post Accelerating Self-Attentions for LLM Serving with FlashInfer for the introduction of FlashInfer project, where we also explain the characteristics of Attention Operators in LLM Serving.
- Dec 2023 Excited to be a recipient of NVIDIA Graduate Fellowship 2024-2025, thank NVIDIA and all my collaborators over the years, let's keep doing cool research!
- Oct 2023 Check our latest LLM serving system Punica, which serves multiple LoRA models with the same latency as a single model! Our paper and the blog post written by Lequn Chen explains the motivation and design of Punica.
- Oct 2023 We released FlashInfer: Kernel Library for LLM Serving, which accelerates LLM operators in serving. We maintained a blog where we write blog posts to explain the techniques that accelerate LLMs, we hope this would be useful for people working in this field.
- Jun 2023 I'll give a talk on Sparsity in LLMs at CTSTA @ PLDI 2023, see you all in Orlando!
- May 2023 Please take a look at the MLC-LLM project, which enables the deployment of LLM on a wide range of hardware platforms. I am honored to be a part of this project and collaborate with an extraordinary team!
- Mar 2023 SparseTIR has been awarded Distinguished Artifact at ASPLOS 2023!
- Feb 2023 I'm going to be a TA for CSE 599M in Spring 2023, which is ML for ML Systems taught by Luis Ceze. I'm really excited to be a part of this new course.
- Jan 2023 We are proud to announce that SparseTIR will be featured at ASPLOS 2023. We'll be heading to Vancouver and can't wait to see everyone there!
Active Projects
- FlashInfer: Kernel Library for LLM Serving
- Punica: Serving multiple LoRA fineuned LLM as one
- MLC-LLM: Universal Deployment of Large Language Models
I'm excited to be part of MLC community and collaborate with a strong team on the following projects in TVM Unity:
- Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
- TensorIR: Tensor-Level Abstractions for Deep Learning Operators
Earlier Projects
- SparseTIR: Compiler Abstraction for Sparse Operators in Deep Learning, built upon Apache TVM.
- ASPLOS Artifact: scripts to reproduce ASPLOS paper results.
- DGL: Efficient and Scalable Deep Learning on Graphs
Selected Publications
ASPLOS 2023 |
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning.
The 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023. Distinguished Artifact Award. |
Activity and Service
- External Review Committee: MLSys 2023
- Artifact Evaluation Committee: MLSys 2022, CGO 2023, OSDI/ATC 2023
- Reviewer: IEEE CAL