Skip to content

Projects

StreamTensor: Make Tensors Stream in Dataflow Accelerators

Jan. 2024 - Present
Document

  • Designed a dataflow-centric typing system and intermediate representation (IR) to model the kernel processing and communication at tensor level in MLIR.

  • Introduced stream-based kernel fusion, on-the-fly memory layout conversion, and dataflow FIFO optimization to reduce off-chip memory access and on-chip memory size.

  • Designed a compilation pipeline that compiles PyTorch model, e.g., large language model (LLM), to low-level IRs targeting dataflow accelerators, such as AMD Versal ACAP and FPGA.


XLS: Accelerated HW Synthesis

May. 2023 - Aug. 2023
GitHub

  • XLS implements a High-level Synthesis (HLS) toolchain which produces synthesizable designs (Verilog and SystemVerilog) from flexible, high-level descriptions of functionality.

  • Proposed a feedback-directed optimization (FDO) method named ISDC that takes downstream tools, e.g., OpenROAD, results as feedback to improve SDC scheduling quality of HLS.

  • Achieved a 28.5% lower register usage compared to the original SDC scheduling on SKY130.


HIDA: A Hierarchical Dataflow Compiler for High-level Synthesis

Mar. 2022 - Jan. 2024
GitHub

  • Proposed a hierarchical dataflow intermediate representation (IR) to model and optimize the complicated dataflow structures in High-level Synthesis (HLS).

  • Designed an algorithm to guide the local design space exploration of each dataflow node while keeping the global dataflow balanced and efficient.


CHARM: A Heterogeneous GEMM Accelerator on Versal ACAP

Dec. 2021 - Oct. 2022
GitHub

  • Mapped GEMM-based models, e.g., BERT and ViT, to accelerators on AMD Versal ACAP; Non-GEMM kernels and data movement kernels are implemented on Programming Logic (PL).

  • Proposed a design space exploration algorithm to determine the tiling strategy at each level of memory.


PolyAIE: A Polyhedral Compiler for Versal ACAP

Oct. 2021 - Apr. 2022
GitHub

  • Designed a compilation flow from C/C++ programs to the AI-Engine (AIE) array on AMD Versal ACAP using Polyhedral compilation techniques in MLIR.

CIRCT: Circuit IR Compilers and Tools

Jun. 2020 - Oct. 2021
GitHub

  • The CIRCT open-source project is an effort looking to apply MLIR and the LLVM development methodology to the domain of hardware design tools.

  • Contributed to the FIRRTL, HW (Hardware), and SV (SystemVerilog) dialects and transformations to establish the hardware 'core IR' of CIRCT and enable a Chisel to SystemVerilog compilation flow.

  • Contributed a new FSM dialect to represent, optimize, and generate codes for finite-state machines.

  • Contributed to the Handshake and Pipeline dialects to enable a High-level Synthesis (HLS) flow that compiles the 'core IR' of MLIR to the hardware 'core IR' of CIRCT.


ScaleHLS: A Scalable High-level Synthesis Framework on MLIR

Apr. 2020 - Mar. 2022
GitHub

  • Designed a multi-level HLS representation and optimization framework in MLIR.

  • Designed an HLS-specific transform and analysis library, including loop and pragma optimizations, an HLS Quality-of-Result (QoR) estimator, and a multi-objective design space explorer.

  • Designed a C/C++ front-end and an HLS C/C++ code generator for MLIR.


DNNExplorer: A Novel Design Paradigm of DNN Accelerator

Feb. 2020 - Mar. 2021

  • Proposed a novel DNN acceleration paradigm which can take advantage of both dataflow pipeline and overlay architectures, enabling a more scalable solution compared to previous arts.

  • Proposed an efficient design space exploration algorithm to generate optimized DNN accelerators following the new paradigm.


HybridDNN: Hybrid Spatial and Winograd DNN Accelerator

Jan. 2019 - Dec. 2019

  • Proposed a hybrid Spatial and Winograd convolution architecture for DNN acceleration.

  • Designed a comprehensive tool for the performance and area estimation and the design space exploration for both edge and cloud FPGAs.


Musket: RISCV-based IoT Sensor-Hub on FPGA

Apr. 2018 - Aug. 2018

  • Pruned and transplanted a RISCV core to an edge FPGA and established a low-power SoC.

  • Ported an RTOS to manage sensors and the wireless connection between FPGA and smartphones.

  • Won the outstanding award of the 2nd China College IC Competition.


RS-Pipeline: Dynamic and Pipelined CNN Accelerator on FPGA

Oct. 2017 - May. 2018

  • Proposed a Dynamic Partial Reconfiguration (DPR) -based pipeline architecture to deploy large CNN accelerators on resource-limited FPGAs while maintaining a low overall latency.