Energy-efficient deep neural networks using Domainwall Memory cache for general-purpose graphics processing units
Abstract
Deep neural networks (DNNs) have become the dominant computational paradigm across
computer vision, natural language processing, and generative modelling, yet achieving
state-of-the-art accuracy increasingly requires models with billions of parameters and commensurately
large memory footprints. These models place extreme bandwidth and capacity
demands on the on-chip memory hierarchy of modern general-purpose graphics processing
units (GPGPUs), making the shared L2 cache a major contributor to energy consumption.
At the same time, aggressive SRAM scaling leads to rapidly increasing leakage power,
presenting a fundamental challenge for future high-performance computing architectures.
Domain Wall Memory (DWM) is a promising alternative for large on-chip caches due to its
ultra-high density and near-zero leakage, but its shift-based access mechanism introduces
variable and often high access latency that must be addressed before practical deployment.
This thesis presents a hardware–software co-design framework that integrates a DWM-based
L2 cache into tensor core (TC)-equipped GPGPUs while mitigating DWM’s shift
penalty. On the hardware side, the conventional SRAM data array is replaced with DWM,
and tape-head prediction policies are employed that proactively reposition track heads based
on predicted access patterns. A hybrid predictor combining stride and two-level context-based
prediction achieves the lowest shift overhead among all evaluated strategies. On the
software side, structured pruning is applied to representative CNN and transformer models
to reduce parameter count and regularize memory accesses, and TC-optimized kernels are
implemented that efficiently exploit the pruned structures.
Across a suite of seven convolutional and attention-based DNN models, pruned DWM-based
L2 caches achieve an average energy saving of 73.2% compared to an unpruned
SRAM-based L2 cache, while delivering an average performance improvement of 13.5%
that effectively mitigates performance degradation across all evaluated models. Under
iso-area conditions, the DWM-based L2 cache achieves 17× more capacity than SRAM, enabling it to outperform SRAM by 7% to 37.4% in execution time and reduce energy
consumption by 53.3% to 71.6%. The resulting Energy–Delay Product (EDP) of SRAM
is 2:3× to 4:58× higher than that of DWM. These results demonstrate that carefully cooptimizing
emerging non-volatile memories at both software and hardware levels can deliver
energy-efficient DNN acceleration without sacrificing performance.
Description
Thesis embargoed until April 23 2027.
