BitbyBit: Cycle-Accurate Custom GPU Silicon for Edge Transformer Inference
A ground-up Verilog-2005 GPU architecture featuring zero-skip MAC units and hardware-page-allocated KV caches for edge LLM acceleration

About this build
itbyBit is a cycle-accurate, ground-up custom silicon GPU architecture engineered in Verilog-2005 specifically for edge Transformer inference (GPT-2 and Gemma-style decoder models). Designed with zero third-party IP cores, the architecture achieves a verified 3.2x cycle-accurate speedup by transitioning from a baseline model path to a hardware-optimized "Imprint Path" featuring silicon-imprinted weight matrices. Key Architectural Highlights: 1. Core Compute: Employs a variable-precision ALU (supporting 4-bit, 8-bit, or 16-bit parallel operations) combined with fused dequantization (INT4-to-INT8) and custom lookup tables (256- entry GELU, Softmax, Inverse Sqrt) for zero-latency in-pipeline mathematical approximations. 2. Zero-Skip Optimization: Implements a hardware bypass in the multiply-accumulate (MAC) units and N-by-N systolic array, skipping compute and gating clock cycles when weight values are zero to minimize dynamic power dissipation. 3. Multi-Banked Scratchpad SRAM: An 8-banked, 4KB scratchpad SRAM utilizing dynamic bank-interleaving, padding, and row-hit-aware translation to eliminate bank conflicts during simultaneous matrix-vector projections. 4. Hardware Paged KV Cache: Virtualized KV cache management implemented via hardware stack-page allocators and hardware page-tables to optimize virtual context windows without off-chip latency penalties. 5. AXI4 / DMA Control Subsystem: Standardized AXI4-Lite command processors for host control register mapping, coupled with high-bandwidth AXI4 Master DMA engines supporting linear and 2D stride weight loading. Physical & Simulation Specs (RTL Target: TSMC 28nm HPC+): - Target Fmax: 300 MHz - Area & Complexity: ~8mm2, 2.5M Gates - Cycle Efficiency: Mean execution latency reduced from 358 cycles (Baseline) to 112 cycles (Imprint path), representing a 68.7% cycle reduction. - Throughput: Up to 892,857 tokens/sec (Imprint Path) and up to 2.67M tokens/sec when utilizing Medusa-style heads (measured at 100 MHz). - Validation: Exhaustively validated against Python-numpy bit-exact golden models with complete testbench coverage and SVA properties to prove pipeline deadlock-freedom
Built with
- Verilog-2005
- SystemVerilog
- Icarus Verilog (iverilog)
- Python
- NumPy
- Next.js
- Tailwind CSS
- Framer Motion
- TypeScript