BitbyBit: Cycle-Accurate Custom GPU Silicon for Edge Transformer Inference

A ground-up Verilog-2005 GPU architecture featuring zero-skip MAC units and hardware-page-allocated KV caches for edge LLM acceleration

Cover image for BitbyBit: Cycle-Accurate Custom GPU Silicon for Edge Transformer Inference

About this build

itbyBit is a cycle-accurate, ground-up custom silicon GPU architecture engineered in Verilog-2005 specifically for edge Transformer inference (GPT-2 and Gemma-style decoder models). Designed with zero third-party IP cores, the architecture achieves a verified 3.2x cycle-accurate speedup by transitioning from a baseline model path to a hardware-optimized "Imprint Path" featuring silicon-imprinted weight matrices. Key Architectural Highlights: 1. Core Compute: Employs a variable-precision ALU (supporting 4-bit, 8-bit, or 16-bit parallel operations) combined with fused dequantization (INT4-to-INT8) and custom lookup tables (256- entry GELU, Softmax, Inverse Sqrt) for zero-latency in-pipeline mathematical approximations. 2. Zero-Skip Optimization: Implements a hardware bypass in the multiply-accumulate (MAC) units and N-by-N systolic array, skipping compute and gating clock cycles when weight values are zero to minimize dynamic power dissipation. 3. Multi-Banked Scratchpad SRAM: An 8-banked, 4KB scratchpad SRAM utilizing dynamic bank-interleaving, padding, and row-hit-aware translation to eliminate bank conflicts during simultaneous matrix-vector projections. 4. Hardware Paged KV Cache: Virtualized KV cache management implemented via hardware stack-page allocators and hardware page-tables to optimize virtual context windows without off-chip latency penalties. 5. AXI4 / DMA Control Subsystem: Standardized AXI4-Lite command processors for host control register mapping, coupled with high-bandwidth AXI4 Master DMA engines supporting linear and 2D stride weight loading. Physical & Simulation Specs (RTL Target: TSMC 28nm HPC+): - Target Fmax: 300 MHz - Area & Complexity: ~8mm2, 2.5M Gates - Cycle Efficiency: Mean execution latency reduced from 358 cycles (Baseline) to 112 cycles (Imprint path), representing a 68.7% cycle reduction. - Throughput: Up to 892,857 tokens/sec (Imprint Path) and up to 2.67M tokens/sec when utilizing Medusa-style heads (measured at 100 MHz). - Validation: Exhaustively validated against Python-numpy bit-exact golden models with complete testbench coverage and SVA properties to prove pipeline deadlock-freedom

Builder

Built with

  • Verilog-2005
  • SystemVerilog
  • Icarus Verilog (iverilog)
  • Python
  • NumPy
  • Next.js
  • Tailwind CSS
  • Framer Motion
  • TypeScript