About me

I am currently a 5th year Ph.D. candidate in the Department of Electrical and Computer Engineering, Cornell University @ Computer System Lab, advised by Prof. Zhiru Zhang.

My research focuses on high-performance computing and performance optimization across heterogeneous devices such as GPUs and NPUs, targeting both AI and scientific applications. I am also actively exploring agentic workflows for automating performance engineering and system optimization.

Education

Cornell University  ·  Sep. 2021 – Present
Ph.D. candidate in Electrical and Computer Engineering

Cornell University  ·  Sep. 2021 – Dec. 2025
M.S. in Electrical and Computer Engineering

Tsinghua University  ·  Sep. 2016 – Jun. 2020
B.E. in Electronic Engineering

Academic Research

Cornell University  ·  Sep. 2021 – Present
Ph.D. candidate, Zhang Research Group, Computer Systems Lab
Advisor: Prof. Zhiru Zhang

Agentic kernel generation and optimization on heterogeneous devices — exploring how LLM agents can autonomously generate and tune high-performance kernels across modern accelerators (GPUs, NPUs).

Rapid GPU-Based Pangenome Graph Layout — proposed the first GPU-based solution for pangenome graph layout, achieving an average 57.3× speedup over the state-of-the-art CPU implementation and enabling minute-scale layout of the entire human chromosome dataset; integrated into the pangenome analysis pipeline ODGI.

Analysis and Optimization of GNN-Based Recommender Systems on Persistent Memory — characterized and optimized GNN-based recommender workloads on persistent-memory hardware.

UCLA  ·  Jun. 2019 – Sep. 2019
Research Intern, VAST Lab
Advisor: Prof. Jason Cong
HeteroHalide: An End-to-End Compilation System from Image Processing DSL to Efficient FPGA Acceleration

  • Proposed HeteroHalide, an end-to-end compilation system from Halide to FPGA accelerators, with a Halide-to-HeteroCL code generator and scheduling extensions that emit lower-level primitives at the spatial-architecture backend.
  • Demonstrated that the generated FPGA accelerators outperformed both multi-core CPU baselines and the state-of-the-art Halide-to-FPGA compiler, while significantly reducing migration effort from Halide.

Tsinghua University  ·  Nov. 2018 – Jun. 2019
Research Assistant, NICS-EFC Lab (Energy Efficient Computing Group)
Advisor: Prof. Yu Wang
Hardware-Friendly Neural Network Training Algorithm Optimization

  • Quantified the impact of low-bit-width quantization and neural network pruning on model on GPUs.
  • Applied model distillation for model compression and studied how varying network sizes affect accuracy.

Industry Experience

AMD Research and Advanced Development  ·  Jan 2026 – May 2026
Ph.D. Research Associate, AMD RAD
Mentors: Erwei Wang, Samuel Bayliss
Agent Skill System for End-to-End LLM Deployment on Spatial NPUs

  • Mapped and optimized an LLM end-to-end on the AMD XDNA™ NPU, outperforming the existing open-source baseline.
  • Built an agent skill system for automating end-to-end LLM deployment on the AMD XDNA™ NPU.

ByteDance  ·  Aug 2024 – Dec 2024
Research Scientist Intern, Machine Learning System team, Seed Foundation
Mentors: Wenlei Bao, Li-Wen Chang
Benchmarking Optimized LLM Kernels

  • Benchmarked LLM kernels across cuBLAS, Triton, and CUTLASS over different problem sizes, data types, and GPU architectures.
  • Gained insights into low-level optimizations, particularly warp specialization with TMA on Hopper GPUs.

NVIDIA  ·  May 2024 – Aug 2024
Deep Learning Training Performance Intern, End-to-End Training Performance team (working on MLPerf-Training)
Mentors: Rachit Garg, Burc Eryilmaz
LLM Training Toolbox: Memory Footprint Analyzer & Config-Shmooer

  • Built a memory footprint analyzer for peak-memory debugging and leak detection in LLM training, integrated into NeMo and the MLPerf training pipeline.
  • Built an autotuner that searches the large training config space (TP/PP/CP sizes, TP overlapping configs, etc.) for the best setup given a model and hardware.

Alibaba DAMO Academy  ·  Jul. 2020 – Jan. 2021
Research Intern, Computing Technology Lab
Mentor: Yuanwei Fang
Micro-architecture Aware Neural Program Embedding

  • Proposed Neural Program Sampling (NPS), a novel framework that provides high-resolution execution embeddings for accurate program sampling.
  • Built the NPS-gem5 evaluation testbed by enhancing gem5 to report detailed-simulation statistics at specific instruction intervals, enabling fast and flexible simulation.

Publications

HeuriGym: An Agentic Benchmark for LLM-Crafted Heuristics in Combinatorial Optimization
Hongzheng Chen*, Yingheng Wang*, Yaohui Cai*, Hins Hu*, Jiajie Li*, Shirley Huang, Chenhui Deng, Rongjian Liang, Shufeng Kong, Haoxing Ren, Samitha Samaranayake, Carla P. Gomes, Zhiru Zhang. (*core contributors)
[ICLR’26]. The International Conference on Learning Representations, 2026.

Dato: A Task-Based Programming Model for Dataflow Accelerators
Shihan Fang, Hongzheng Chen, Niansong Zhang, Jiajie Li, Han Meng, Adrian Liu, Zhiru Zhang.
arXiv:2509.06794, 2025.

Rapid GPU-Based Pangenome Graph Layout
Jiajie Li, Jan-Niklas Schmelzle, Yixiao Du, Simon Heumos, Andrea Guarracino, Giulia Guidi, Pjotr Prins, Erik Garrison, Zhiru Zhang.
[SC’24]. The International Conference for High Performance Computing, Networking, Storage, and Analysis, 2024.

Pangenome graph layout by Path-Guided Stochastic Gradient Descent
Simon Heumos, Andrea Guarracino, Jan-Niklas M Schmelzle, Jiajie Li, Zhiru Zhang, Jörg Hagmann, Sven Nahnsen, Pjotr Prins, Erik Garrison.
[Bioinformatics]. Volume 40, Issue 7, July 2024.

NPS: A Framework for Accurate Program Sampling Using Graph Neural Network
Yuanwei Fang, Zihao Liu, Yanheng Lu, Jiawei Liu, Jiajie Li, Yi Jin, Jian Chen, Yenkuang Chen, Hongzhong Zheng, Yuan Xie.
arXiv:2304.08880, 2023.

Analysis and Optimization of GNN-Based Recommender Systems on Persistent Memory
Yuwei Hu, Jiajie Li, Zhongming Yu, Zhiru Zhang.
arXiv:2207.11918, 2022.

HeteroHalide: From Image Processing DSL to Efficient FPGA Acceleration
Jiajie Li, Yuze Chi, Jason Cong.
[FPGA’20]. 28th ACM/SIGDA International Symposium on Field-Programmable Gate Arrays, 2020.

Teaching

Cornell University  ·  Aug. 2023 – Dec. 2023
Teaching Assistant, ECE 2300 Digital Logic and Computer Organization

Services

Student Volunteer at FCCM’22


Last updated: May 2026