Zihao Ye (叶子豪)
Ph.D. student @ SAMPL, UW CSE
Bill & Melinda Gates Center, Room 330
Email : zhye [at] cs [dot] washington [dot] edu
Strava : My Account
Github : yzh119
About Me
This is Zihao Ye, a fourth-year Ph.D. student at the University of Washington's Paul G. Allen School of Computer Science and Engineering, advised by Luis Ceze in the SAMPL research group. I also work closely with Tianqi Chen on Machine Learning Compilers.
Prior to joining UW, I spent two years at AWS where I worked with Minjie Wang and Zheng Zhang. I obtained my bachelor's degree from ACM Honors Class at Shanghai Jiao Tong University. We are organizing talks at SAMPL, topics include Systems, Architecture, Compilers, Verification and Machine Learning.
Research
My current research include Machine Learning Compilers and Sparse Computation.
I'm focusing on developing FlashInfer and some related research projects, feel free to drop me an email if are interested, and I'm open to collaborations.
News
- July 2024 Finished my first STP Ride.
- Jun 2024 Starting my internship at NVIDIA on Deep Learning Compilers.
- Apr 2024 I have recovered from a long period of burnout and depression. Now I am ready to get back to work. I am grateful for the support from my love, my friends and family since October 2023.
The following updates are outdated:
- Jan 2024 Check our latest blog post Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding which explains how accelerate shared-prefix LLM serving with cascade inference, related APIs have been integrated into FlashInfer.
- Jan 2024 Check our blog post Accelerating Self-Attentions for LLM Serving with FlashInfer for the introduction of FlashInfer project, where we also explain the characteristics of Attention Operators in LLM Serving.
- Dec 2023 Excited to be a recipient of NVIDIA Graduate Fellowship 2024-2025, thank NVIDIA and all my collaborators over the years, let's keep doing cool research!
- Oct 2023 Check our latest LLM serving system Punica, which serves multiple LoRA models with the same latency as a single model! Our paper and the blog post written by Lequn Chen explains the motivation and design of Punica.
- Oct 2023 We released FlashInfer: Kernel Library for LLM Serving, which accelerates LLM operators in serving. We maintained a blog where we write blog posts to explain the techniques that accelerate LLMs, we hope this would be useful for people working in this field.
- Jun 2023 I'll give a talk on Sparsity in LLMs at CTSTA @ PLDI 2023, see you all in Orlando!
- May 2023 Please take a look at the MLC-LLM project, which enables the deployment of LLM on a wide range of hardware platforms. I am honored to be a part of this project and collaborate with an extraordinary team!
- Mar 2023 SparseTIR has been awarded Distinguished Artifact at ASPLOS 2023!
- Feb 2023 I'm going to be a TA for CSE 599M in Spring 2023, which is ML for ML Systems taught by Luis Ceze. I'm really excited to be a part of this new course.
- Jan 2023 We are proud to announce that SparseTIR will be featured at ASPLOS 2023. We'll be heading to Vancouver and can't wait to see everyone there!
Misc
Heaven Sent is my favorite Doctor Who episode and it helped me get through every dark moments during my PhD.
Active Projects
- FlashInfer: Kernel Library for LLM Serving
- MLC-LLM: Universal Deployment of Large Language Models
I'm excited to be part of MLC community and collaborate with a strong team on the following projects in TVM Unity:
- Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
- TensorIR: Tensor-Level Abstractions for Deep Learning Operators
Earlier Projects
- Punica: Serving multiple LoRA fineuned LLM as one
- SparseTIR: Compiler Abstraction for Sparse Operators in Deep Learning, built upon Apache TVM.
- ASPLOS Artifact: scripts to reproduce ASPLOS paper results.
- DGL: Efficient and Scalable Deep Learning on Graphs
Selected Publications
ASPLOS 2023 |
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning.
The 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023. Distinguished Artifact Award. |