About Me

I am a graduate student at Cornell Tech and a research assistant at the Computer Systems Laboratory, where I design efficient algorithms for large language models and build scalable ML systems. I received my B.S. in Computer Science (CS) and Electrical and Computer Engineering (ECE) from Cornell University with summa cum laude.

I'm graduating in December 2025 and am actively seeking Software Engineer, Machine Learning Engineer, and Research Engineer opportunities. I'm interested in exploring areas outside my past experiences to broaden my expertise and contribute to diverse technical challenges. My updated resume can be found here.

My recent work explores test-time optimization for diffusion language models, focusing on caching strategies and diffusion scheduling to accelerate inference. In the past, I have worked on recommendation systems (retrieval and ranking) and DNN model quantization.

I am broadly interested in ML systems optimization, distributed computing, and efficient inference for large-scale AI, with an emphasis on connecting research to practical and scalable solutions. I have been fortunate to collaborate with Prof. Zhiru Zhang, Prof. Udit Gupta, Prof. Mohamed Abdelfattah, and Prof. Jae-sun Seo throughout my academic journey.

Research

FlashDLM: Test-Time Optimization for Diffusion LLM Inference

Novel caching strategies and guided diffusion for 34× inference speedup
Thanks to my collaborators Jian Meng and Yash Akhauri for their valuable contributions to this work.

Diffusion language models enable parallel token generation and bidirectional context but suffer from slow inference due to iterative denoising, making them impractical for long context reasoning compared to autoregressive models.

This work introduces FreeCache and Guided Diffusion, two training-free techniques for accelerating diffusion inference. FreeCache reuses stable key-value projections across steps, and Guided Diffusion employs a lightweight autoregressive model to guide token unmasking, reducing the number of iterations while preserving coherence.

Together these methods achieve up to 34× end-to-end speedup with negligible accuracy loss, making diffusion models as efficient as autoregressive baselines and enabling their deployment in real-world applications.

FreeCache

FreeCache Architecture

GuidedDiffusion

Guided Diffusion Architecture

ReCoOpt: Co-Design Framework for Efficient Recommendation System

Hardware-software co-design for end-to-end recommendation optimization
Thanks to my collaborator Mark Zhao for his valuable contributions to this work.

Modern recommendation systems execute multi-stage pipelines at massive scale, but most optimization efforts target only the deep neural network ranking stage, leaving major bottlenecks in feature fetching, approximate nearest neighbor search, and orchestration.

To address this gap, ReCoOpt is introduced as a modular framework that enables systematic hardware and software co-design of end-to-end recommendation inference. The framework profiles and tunes retrieval, feature fetching, and ranking across heterogeneous platforms to explore balanced pipeline configurations.

Using MovieLens datasets, ReCoOpt demonstrates that holistic co-design can improve hit rate by 0.05 under a fixed latency budget, double throughput, or reduce latency by 40%, showing that balanced hardware and software optimization is essential for scalable recommendation engines.

Recommendation System Architecture Diagram

OverQ: Overwrite Quantization for Outlier-Aware CNN Acceleration

Overwrite quantization technique for handling activation outliers in CNNs
Thanks to my mentor Jordan Dotzel for his guidance and support on this work.

Low precision quantization reduces the cost of deep neural networks but suffers from rare outliers in activations that degrade accuracy. Existing fixes require retraining or expensive outlier hardware.

OverQ introduces overwrite quantization, which lets outliers reuse nearby zeros to gain extra range or precision. A simple cascading mechanism expands coverage, and the design fits efficiently into systolic array accelerators.

With modest overhead, OverQ handles over 90% of outliers, improving ImageNet accuracy by up to 5% at 4 bits while adding only about 10% area per processing element.

OverQ Architecture

Teaching

Course Development: MLSys Teaching Frameworks (Cornell ECE 5545)

Led development of user-friendly PyTorch frameworks for speech recognition model training, fine-tuning, quantization, and deployment. Wrote TVM tutorials and directed TinyML Keyword Spotting deployment on Arduino Nano 33 BLE.

PyTorch ONNX TensorFlow Lite Apache TVM Arduino
MLSys Course Development

Teaching Assistant

Served as Teaching Assistant for the following courses at Cornell:

  • CS 1110: Introduction to Computing; CS 4/5780: Introduction to Machine Learning; CS 4/5410: Operating Systems; ECE 5755: Modern Computer Systems and Architecture

Technical Projects

Custom Compiler for x86-64

Led team of 3 to build complete compiler in Java targeting x86-64 assembly.
Implemented 12.5K lines of code with lexical analysis, semantic analysis, and optimization passes.

Java x86-64 Assembly Compilers

Sokoban Game Engine in OCaml

Designed and implemented a GUI-based Sokoban engine in OCaml using the Graphics Module, supporting interactive rendering and state management.
Extended with multiplayer synchronization, checkpointing, and rule-based constraints.

OCaml Graphics Module GUI Game Engine
Sokoban Game Demo

RISC-V Multicore Processor in Verilog

Implemented quad-core RISC-V processor in Verilog with pipelined execution, bypassing, and variable-latency ALU with iterative multiplier.
Designed memory subsystem and cache hierarchy (two-way set-associative).

Verilog RISC-V PyMTL C Cache Design

Publications

FlashDLM: Accelerating Diffusion Language Model Inference via Efficient KV Caching and Guided Diffusion

preprint

Zhanqiu Hu, Jian Meng, Yash Akhauri, Mohamed S. Abdelfattah, Jae-sun Seo, Zhiru Zhang, Udit Gupta

A Full-Stack HW/SW Co-Design Analysis for Recommendation System Inference

Submitted to IEEE Micro

Zhanqiu Hu, Mark Zhao, Zhiru Zhang, Udit Gupta

OverQ: Opportunistic Outlier Quantization for Neural Network Accelerators

preprint

Ritchie Zhao, Jordan Dotzel, Zhanqiu Hu, Preslav Ivanov, Christopher De Sa, Zhiru Zhang