Energy-efficient deep neural networks using Domainwall Memory cache for general-purpose graphics processing units

Loading...
Thumbnail Image

Date

Authors

Namvarmotlagh, Alireza

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Deep neural networks (DNNs) have become the dominant computational paradigm across computer vision, natural language processing, and generative modelling, yet achieving state-of-the-art accuracy increasingly requires models with billions of parameters and commensurately large memory footprints. These models place extreme bandwidth and capacity demands on the on-chip memory hierarchy of modern general-purpose graphics processing units (GPGPUs), making the shared L2 cache a major contributor to energy consumption. At the same time, aggressive SRAM scaling leads to rapidly increasing leakage power, presenting a fundamental challenge for future high-performance computing architectures. Domain Wall Memory (DWM) is a promising alternative for large on-chip caches due to its ultra-high density and near-zero leakage, but its shift-based access mechanism introduces variable and often high access latency that must be addressed before practical deployment. This thesis presents a hardware–software co-design framework that integrates a DWM-based L2 cache into tensor core (TC)-equipped GPGPUs while mitigating DWM’s shift penalty. On the hardware side, the conventional SRAM data array is replaced with DWM, and tape-head prediction policies are employed that proactively reposition track heads based on predicted access patterns. A hybrid predictor combining stride and two-level context-based prediction achieves the lowest shift overhead among all evaluated strategies. On the software side, structured pruning is applied to representative CNN and transformer models to reduce parameter count and regularize memory accesses, and TC-optimized kernels are implemented that efficiently exploit the pruned structures. Across a suite of seven convolutional and attention-based DNN models, pruned DWM-based L2 caches achieve an average energy saving of 73.2% compared to an unpruned SRAM-based L2 cache, while delivering an average performance improvement of 13.5% that effectively mitigates performance degradation across all evaluated models. Under iso-area conditions, the DWM-based L2 cache achieves 17× more capacity than SRAM, enabling it to outperform SRAM by 7% to 37.4% in execution time and reduce energy consumption by 53.3% to 71.6%. The resulting Energy–Delay Product (EDP) of SRAM is 2:3× to 4:58× higher than that of DWM. These results demonstrate that carefully cooptimizing emerging non-volatile memories at both software and hardware levels can deliver energy-efficient DNN acceleration without sacrificing performance.

Description

Thesis embargoed until April 23 2027.

Keywords

Citation

Endorsement

Review

Supplemented By

Referenced By