This is Zihao Ye, a third-year Ph.D. student at the University of Washington's Paul G. Allen School of Computer Science and Engineering, advised by Luis Ceze in the SAMPL research group. I also work closely with Tianqi Chen on Machine Learning Compilation.
We are organizing talks at SAMPL, topics include Systems, Architecture, Compilers, Verification and Machine Learning.
My current research focuses on Machine Learning Compilation, Sparse Computation and Systems for Foundation Models Serving.
Feel free to drop me an email if we have aligned interests, and I'm open to collaborations.
- Jan 2024 Check our latest blog post Cascade Inference: Memory Bandwidth Efficient Shared Prefix Batch Decoding which explains how accelerate shared-prefix LLM serving with cascade inference, related APIs have been integrated into FlashInfer.
- Jan 2024 Check our blog post Accelerating Self-Attentions for LLM Serving with FlashInfer for the introduction of FlashInfer project, where we also explain the characteristics of Attention Operators in LLM Serving.
- Dec 2023 Excited to be a recipient of NVIDIA Graduate Fellowship 2024-2025, thank NVIDIA and all my collaborators over the years, let's keep doing cool research!
- Oct 2023 Check our latest LLM serving system Punica, which serves multiple LoRA models with the same latency as a single model! Our paper and the blog post written by Lequn Chen explains the motivation and design of Punica.
- Oct 2023 We released FlashInfer: Kernel Library for LLM Serving, which accelerates LLM operators in serving. We maintained a blog where we write blog posts to explain the techniques that accelerate LLMs, we hope this would be useful for people working in this field.
- Jun 2023 I'll give a talk on Sparsity in LLMs at CTSTA @ PLDI 2023, see you all in Orlando!
- May 2023 Please take a look at the MLC-LLM project, which enables the deployment of LLM on a wide range of hardware platforms. I am honored to be a part of this project and collaborate with an extraordinary team!
- Mar 2023 SparseTIR has been awarded Distinguished Artifact at ASPLOS 2023!
- Feb 2023 I'm going to be a TA for CSE 599M in Spring 2023, which is ML for ML Systems taught by Luis Ceze. I'm really excited to be a part of this new course.
- Jan 2023 We are proud to announce that SparseTIR will be featured at ASPLOS 2023. We'll be heading to Vancouver and can't wait to see everyone there!
- FlashInfer: Kernel Library for LLM Serving
- Punica: Serving multiple LoRA fineuned LLM as one
- MLC-LLM: Universal Deployment of Large Language Models
I'm excited to be part of MLC community and collaborate with a strong team on the following projects in TVM Unity:
- Relax: Composable Abstractions for End-to-End Dynamic Machine Learning
- TensorIR: Tensor-Level Abstractions for Deep Learning Operators
Composable Abstractions for Sparse Compilation in Deep Learning
- Mar 2023 @ ASPLOS 2023, Vancouver, Canada
- Mar 2023 @ TVMCon
- Nov 2022 @ Amazon AI
- Oct 2022 @ CRISP Liaison Meeting
- Aug 2022 @ Tsinghua NISF-EFC Group
- Aug 2022 @ Cornell Zhang Research Group
- July 2022 @ Google MLIR Reading Group
- Dec 2021 @ TVMCon
SparseTIR: Composable Abstractions for Sparse Compilation in Deep Learning.
The 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, 2023. Distinguished Artifact Award.