Learning Performant and Efficient Representations suitable for Hardware Acceleration
The challenge in training sparse neural networks is to achieve both high accuracy and practical hardware acceleration. Unstructured sparsity often yields good performance but is hard to speed up, while traditional structured sparsity can hurt performance.
Our International Conference in Learning Representations (ICLR) 2024 paper, “Dynamic Sparse Training with Structured Sparsity”
State-of-the-art deep neural networks (DNNs) have achieved remarkable feats, but their ever-increasing size brings ballooning training costs, often outstripping Moore’s Law. This trend makes cutting-edge AI research less accessible. While techniques exist to prune trained dense models, effectively reducing their parameter count by 85-95% without sacrificing generalization, they relay on dense pre-training. Can we train these sparse, efficient networks without dense training?
Several approaches have been developed to tackle the challenge of obtaining sparse networks, these include:
Standard pruning techniques involve training a full, dense network and then removing (pruning) weights deemed less important, typically those with the smallest magnitude. This can be done once (“one-shot pruning”) or iteratively. While effective at finding highly sparse subnetworks that retain accuracy, this still necessitates the expensive initial dense training.
Despite the success of pruning, simply training a sparse network from a random initialization, even with a known “good” sparse mask (the pattern of zeroed-out weights), often leads to poor performance compared to its dense counterpart or a pruned dense model. This is known as the sparse training problem.
The Lottery Ticket Hypothesis (LTH)
Dynamic Sparse Training (DST) methods offer a more direct approach to training sparse networks. Techniques like Sparse Evolutionary Training (SET)
A significant challenge with many DST methods like RigL is that they typically produce unstructured sparsity. This means individual weights are zeroed out irregularly across the weight matrices.
In contrast, structured sparsity involves removing entire blocks of weights, such as channels, filters, or even neurons.
N:M fine-grained sparsity is a compromise where, within small contiguous blocks of M weights, exactly N weights are non-zero. NVIDIA’s Ampere GPUs support 2:4 sparsity, offering some acceleration
The ideal scenario is to combine the high accuracy of unstructured DST with the hardware-friendliness of fine-grained structured sparsity.
Our work “Dynamic Sparse Training with Structured Sparsity”
SRigL modifies the RigL algorithm to enforce a constant fan-in constraint. This means each neuron (or output channel in a convolutional layer) has the same number of active incoming connections. This is a specific type of N:M sparsity (where N is the fan-in and M is the potential dense fan-in) and results in a regular structure within the weight matrices. Theoretical analysis suggests that this constant fan-in constraint should not inherently impair training dynamics and might even offer slightly better output-norm variance compared to less constrained sparsity patterns, especially for very sparse networks.
A key empirical finding was that standard unstructured DST methods, like RigL, when pushed to very high sparsity levels (>90%), implicitly learn to ablate neurons — that is, it they learn to remove neurons with very few weights, effectively reducing the width of layers.
This neuron ablation appears crucial for maintaining generalization at extreme sparsities, but enforcing a naive constant fan-in constraint would prevent this, as it would force every neuron to maintain the same number of weights, even if those weights are not useful for learning.
SRigL integrates the constant fan-in objective with an explicit neuron ablation mechanism. The core steps, adapted from RigL, are:
This allows SRigL to learn both fine-grained constant fan-in sparsity within active neurons and coarser neuron-level structured sparsity.
SRigL was evaluated on image classification tasks using CIFAR-10 (ResNet-18, Wide ResNet-22) and ImageNet (ResNet-50, MobileNet-V3, ViT-B/16)
SRigL with neuron ablation was shown to achieve generalization performance comparable to unstructured RigL and often close to the dense training baseline, even at high sparsities (e.g., 90-95%) across various architectures. Extended training further improved performance, similar to RigL.
The neuron ablation component was critical. Without it, SRigL’s performance lagged behind unstructured RigL at very high sparsities (>90%) and with Vision Transformers. Enabling SRigL to ablate neurons restored performance to RigL levels. The percentage of active neurons (not ablated) learned by SRigL dynamically adapted with sparsity, mirroring RigL’s behavior. For Vision Transformers, SRigL’s performance was particularly sensitive to the ablation threshold $\gamma_{sal}$, with higher thresholds performing best, suggesting that aggressively ablating neurons to maintain sufficient density in the remaining ones is beneficial for ViTs.
The structured sparsity learned by SRigL (constant fan-in + ablated neurons) translates into tangible inference speedups. The paper demonstrates a “condensed” matrix multiplication method (Algorithm 1 in the paper ) that leverages this structure.
These speedups are achieved even with a straightforward PyTorch implementation, highlighting the practical benefits of the learned structure.
SRigL’s structured sparsity is not just about speed; it also opens up new avenues for neural networks. The ability to learn a combination of fine-grained constant fan-in and neuron-level structured sparsity enables otherwise infeasible applications with neural networks.
One interesting use case is in extreme classification, where the number of classes can reach millions. Representing such a large number of classes with dense models is impractical due to the sheer size and complexity of the model. For instance, in a typical image classification task with 1 million classes, a dense model would require a weight matrix of size $1 \text{M} \times 1 \text{M}$, which is not only computationally expensive but also memory-intensive. Already, SRigL has been successfully applied to extreme classification in our follow-up NeurIPS 2024 work “Navigating Extremes: Dynamic Sparsity in Large Output Spaces” in collaboration with Aalto University and the University of Bath
The same problem applies to other tasks like natural language processing (NLP) and recommendation systems, where the number of classes can be extremely large, and SRigL’s ability to learn structured sparsity can help in efficiently representing and processing these large output spaces.
“Dynamic Sparse Training with Structured Sparsity”
The insight that successful DST methods at high sparsity inherently learn to reduce model width (neuron ablation) is key and SRigL formalizes this. This work underscores that much of the progress in deep learning comes from methods that better leverage hardware capabilities.
SRigL paves the way for deploying highly efficient and accurate sparse models in a wider range of applications, making powerful AI more accessible and sustainable.
If you find this work useful, please consider citing it using the following BibTeX entry:
@inproceedings{lasby2024srigl,
author = {Lasby, Mike and Golubeva, Anna and Evci, Utku and Nica, Mihai and Ioannou, Yani},
booktitle = {International Conference on Learning Representations (ICLR)},
venue = {Vienna, Austria},
eventdate = {2024-05-07/2024-05-11},
title = {Dynamic Sparse Training with Structured Sparsity},
year = {2024},
arxivid = {2305.02299},
eprint = {2305.02299},
eprinttype = {arXiv}
}
"Dynamic Sparse Training with Structured Sparsity" (https://t.co/MCxVCeMYt0) was accepted at ICLR 2024! DST methods learn state-of-the-art sparse masks, but accelerating DNNs with unstructured masks is difficult. SRigL learns structured masks, improving real-world CPU/GPU timings pic.twitter.com/zZASJlXtRi
— Mike Lasby (@mikelasby) January 23, 2024