Dynamic Sparse Training with Structured Sparsity

Learning Performant and Efficient Representations suitable for Hardware Acceleration

TL;DR

The challenge in training sparse neural networks is to achieve both high accuracy and practical hardware acceleration. Unstructured sparsity often yields good performance but is hard to speed up, while traditional structured sparsity can hurt performance.

Our International Conference in Learning Representations (ICLR) 2024 paper, “Dynamic Sparse Training with Structured Sparsity” introduces Structured RigL (SRigL) to addresses this by dynamically learning hardware-friendly sparse weight representations without sacrificing accuracy. Key findings of the work include:

SRigL successfully learns a combination of fine-grained N:M structured sparsity (constant fan-in) and neuron-level sparsity (neuron ablation) dynamically from a sparse initialization.
The explicit integration of neuron ablation, a behavior implicitly learned by unstructured DST methods at high sparsities, is crucial for SRigL to match the generalization performance of dense and unstructured sparse models, even at extreme sparsities (up to 99%).
The learned structured sparsity enables a “condensed sparse representation,” which translates to significant real-world inference speedups on commodity CPUs (up to $3.4 \times$ vs. dense at 90% sparsity) and GPUs (up to $1.7 \times$ vs. dense at 90% sparsity), outperforming unstructured sparse formats in many practical scenarios.
SRigL demonstrates a viable path to train sparse neural networks that are both highly accurate and practically efficient, bridging the gap between unstructured DST performance and structured sparsity acceleration.

The Quest for More Efficient Neural Networks

State-of-the-art deep neural networks (DNNs) have achieved remarkable feats, but their ever-increasing size brings ballooning training costs, often outstripping Moore’s Law. This trend makes cutting-edge AI research less accessible. While techniques exist to prune trained dense models, effectively reducing their parameter count by 85-95% without sacrificing generalization, they relay on dense pre-training. Can we train these sparse, efficient networks without dense training?

Why Sparse Neural Networks?

Sparse neural networks offer several compelling advantages:

For a fixed number of weights, they can offer better generalization and fewer Floating Point Operations (FLOPs) at inference.
They hold the potential to significantly reduce the computational cost of training.
They provide a way to learn the inherent structure of neural networks, moving beyond a "one-size-fits-all" dense architecture.

Paths to Sparsity: Pruning, Lottery Tickets, and Dynamic Sparse Training

Several approaches have been developed to tackle the challenge of obtaining sparse networks, these include:

Traditional Pruning: Effective but Costly

Pruning is highly effective, but requires dense pre-training.

Figure: Pruning Dense Trained Models: Pruning a trained model is highly effective, but requires pre-trained models.

Standard pruning techniques involve training a full, dense network and then removing (pruning) weights deemed less important, typically those with the smallest magnitude. This can be done once (“one-shot pruning”) or iteratively. While effective at finding highly sparse subnetworks that retain accuracy, this still necessitates the expensive initial dense training.

The Sparse Training Problem: Training with a Fixed Sparse Mask

Figure: Sparse Training Problem: Training a sparse mask from random initialization doesn't recover the generalization of a pruned dense model, even for a known good mask.

Despite the success of pruning, simply training a sparse network from a random initialization, even with a known “good” sparse mask (the pattern of zeroed-out weights), often leads to poor performance compared to its dense counterpart or a pruned dense model. This is known as the sparse training problem.

The Lottery Ticket Hypothesis (LTH) proposed that within a large, randomly initialized dense network, there exist smaller subnetworks (the “winning tickets”) that, when trained in isolation from their original initialization weights (or weights from very early in training ), can achieve accuracy comparable to the full dense network. This was a foundational idea, suggesting that specific initializations within a sparse structure are crucial. However, finding these “winning tickets” is computationally expensive, especially for larger models, and the original findings were primarily on smaller datasets . Research further showed that these Lottery Tickets often end up re-learning the solution that would have been found by pruning the dense model they originated from .

Dynamic Sparse Training (DST): Training Sparse from the Start

Dynamic Sparse Training (DST) methods are an alternative sparse training method that work well.

Figure: Dynamic Sparse Training (DST) methods are an alternative sparse training, where the mask is changed over the course of training. DST methods can match the generalization performance of standard dense training.

Dynamic Sparse Training (DST) methods offer a more direct approach to training sparse networks. Techniques like Sparse Evolutionary Training (SET) and Rigging the Lottery Ticket (RigL) train networks that are sparse from initialization to the final solution (“sparse-to-sparse”). They achieve this by dynamically changing the sparse connectivity during training: periodically pruning less salient connections (e.g., small magnitude weights) and growing new ones (e.g., where the gradient magnitude is large). DST can achieve generalization comparable to dense training at high sparsity levels.

Unstructured vs. Structured Sparsity

Figure: Unstructured vs. Structured sparsity. Unstructured sparsity is difficult to use efficiently on current computational hardware, such as CPUs and GPUs.

Unstructured Sparsity

A significant challenge with many DST methods like RigL is that they typically produce unstructured sparsity. This means individual weights are zeroed out irregularly across the weight matrices.

Figure: Unstructured Sparsity. A neural network with unstructured sparse weights, and the corresponding weight matrix.

Pros: Can achieve excellent generalization at very high sparsities (85-95%); fewer theoretical FLOPs.
Cons: Poorly supported by standard hardware (CPUs/GPUs) and acceleration libraries, meaning theoretical speedups often don’t translate into real-world gains.

Structured Sparsity (e.g., removing neurons/blocks)

Figure: Structured Sparsity. A neural network with structured sparse weights, and the corresponding weight matrix.

In contrast, structured sparsity involves removing entire blocks of weights, such as channels, filters, or even neurons.

Pros: Much better hardware support, leading to practical speedups as it often results in effectively smaller dense operations.
Cons: Often leads to poorer generalization compared to unstructured sparsity at the same overall sparsity level, as it’s a coarser form of pruning.

N:M Fine-grained Structured Sparsity

Figure: N:M Structured Sparsity. A neural network with N:M structured sparse weights, and the corresponding weight matrix.

N:M fine-grained sparsity is a compromise where, within small contiguous blocks of M weights, exactly N weights are non-zero. NVIDIA’s Ampere GPUs support 2:4 sparsity, offering some acceleration .

The ideal scenario is to combine the high accuracy of unstructured DST with the hardware-friendliness of fine-grained structured sparsity.

Dynamic Sparse Training for Learning Structured Sparse Representations

Our work “Dynamic Sparse Training with Structured Sparsity” proposes a novel DST method called Structured RigL (SRigL) to address this challenge. SRigL aims to learn sparse networks that are both highly accurate and possess a structure amenable to real-world acceleration.

Constant Fan-in Sparsity

SRigL modifies the RigL algorithm to enforce a constant fan-in constraint. This means each neuron (or output channel in a convolutional layer) has the same number of active incoming connections. This is a specific type of N:M sparsity (where N is the fan-in and M is the potential dense fan-in) and results in a regular structure within the weight matrices. Theoretical analysis suggests that this constant fan-in constraint should not inherently impair training dynamics and might even offer slightly better output-norm variance compared to less constrained sparsity patterns, especially for very sparse networks.

The Hidden Trick of DST: Neuron Ablation

Figure: Neuron Ablation in DST. When a neuron has too few weight to learn useful representations, DST methods learn to ablate, or remove, the entire neuron — effectively reducing the width of the layer.

A key empirical finding was that standard unstructured DST methods, like RigL, when pushed to very high sparsity levels (>90%), implicitly learn to ablate neurons — that is, it they learn to remove neurons with very few weights, effectively reducing the width of layers.

This neuron ablation appears crucial for maintaining generalization at extreme sparsities, but enforcing a naive constant fan-in constraint would prevent this, as it would force every neuron to maintain the same number of weights, even if those weights are not useful for learning.

The SRigL Algorithm

Figure: Constant Fan-in Fine-grained Sparsity. A form of N:M sparsity where every neuron on a layer is constained to have the same number of weights.

SRigL integrates the constant fan-in objective with an explicit neuron ablation mechanism. The core steps, adapted from RigL, are:

Identify weights to prune (smallest magnitude) and potential connections to grow (largest gradient magnitude on zeroed weights).
Count salient weights per neuron.
Ablate neurons: If a neuron has fewer salient weights than a defined threshold ($\gamma_{sal}$ multiplied by the target fan-in), it’s entirely pruned. Its designated weights are redistributed.
Compute the new constant fan-in based on any ablated neurons.
Prune the globally smallest magnitude weights.
For each active neuron, regrow connections to meet the target constant fan-in, prioritizing those with the largest gradient magnitudes.

This allows SRigL to learn both fine-grained constant fan-in sparsity within active neurons and coarser neuron-level structured sparsity.

Key Results: Performance and Acceleration

SRigL was evaluated on image classification tasks using CIFAR-10 (ResNet-18, Wide ResNet-22) and ImageNet (ResNet-50, MobileNet-V3, ViT-B/16) .

Matching Dense Accuracy with Structured Sparsity

Figure: (a) Percentage active neurons (i.e., not ablated) following RigL/SRigL training on ResNet-50/ImageNet (b) Sparse Fan-In vs. ViT layer index at the end of training with RigL at 90% sparsity.

SRigL with neuron ablation was shown to achieve generalization performance comparable to unstructured RigL and often close to the dense training baseline, even at high sparsities (e.g., 90-95%) across various architectures. Extended training further improved performance, similar to RigL.

The Importance of Neuron Ablation

Figure: (a) Percentage active neurons (i.e., not ablated) following RigL/SRigL training on ResNet-50/ImageNet (b) Sparse Fan-in vs. ViT layer index at the end of training with RigL at 90% sparsity.

The neuron ablation component was critical. Without it, SRigL’s performance lagged behind unstructured RigL at very high sparsities (>90%) and with Vision Transformers. Enabling SRigL to ablate neurons restored performance to RigL levels. The percentage of active neurons (not ablated) learned by SRigL dynamically adapted with sparsity, mirroring RigL’s behavior. For Vision Transformers, SRigL’s performance was particularly sensitive to the ablation threshold $\gamma_{sal}$, with higher thresholds performing best, suggesting that aggressively ablating neurons to maintain sufficient density in the remaining ones is beneficial for ViTs.

Real-World Speedups

Figure: (a) Percentage active neurons (i.e., not ablated) following RigL/SRigL training on ResNet-50/ImageNet (b) Batched GPU inference with batch size of 2048 on an NVIDIA Titan V. At 90% sparsity, our condensed representation is 1.7$\times$ faster than dense and 13.0$\times$ faster than unstructured (CSR) sparse layers. Note y-axis is log-scaled..

The structured sparsity learned by SRigL (constant fan-in + ablated neurons) translates into tangible inference speedups. The paper demonstrates a “condensed” matrix multiplication method (Algorithm 1 in the paper ) that leverages this structure.

CPU (Online Inference, single input): At 90% sparsity, SRigL’s condensed representation was up to 3.4x faster than dense and 2.5x faster than unstructured (CSR) sparse layers on an Intel Xeon CPU.
GPU (Batched Inference, batch size 256): At 90% sparsity, it was 1.7x faster than dense and 13.0x faster than unstructured (CSR) sparse layers on an NVIDIA Titan V GPU.

These speedups are achieved even with a straightforward PyTorch implementation, highlighting the practical benefits of the learned structure.

Not Only Efficiency: SRigL Enables New Applications of Neural Networks

SRigL’s structured sparsity is not just about speed; it also opens up new avenues for neural networks. The ability to learn a combination of fine-grained constant fan-in and neuron-level structured sparsity enables otherwise infeasible applications with neural networks.

One interesting use case is in extreme classification, where the number of classes can reach millions. Representing such a large number of classes with dense models is impractical due to the sheer size and complexity of the model. For instance, in a typical image classification task with 1 million classes, a dense model would require a weight matrix of size $1 \text{M} \times 1 \text{M}$, which is not only computationally expensive but also memory-intensive. Already, SRigL has been successfully applied to extreme classification in our follow-up NeurIPS 2024 work “Navigating Extremes: Dynamic Sparsity in Large Output Spaces” in collaboration with Aalto University and the University of Bath .

The same problem applies to other tasks like natural language processing (NLP) and recommendation systems, where the number of classes can be extremely large, and SRigL’s ability to learn structured sparsity can help in efficiently representing and processing these large output spaces.

Conclusion and Future Horizons

“Dynamic Sparse Training with Structured Sparsity” makes a significant stride towards practical sparse neural networks. SRigL demonstrates that it’s possible to:

Train networks from a sparse initialization to a sparse solution (sparse-to-sparse).
Achieve generalization performance on par with state-of-the-art unstructured sparse training methods.
Learn a combination of fine-grained constant fan-in and neuron-level structured sparsity.
Realize significant real-world inference acceleration on both CPUs and GPUs due to this learned structure.

The insight that successful DST methods at high sparsity inherently learn to reduce model width (neuron ablation) is key and SRigL formalizes this. This work underscores that much of the progress in deep learning comes from methods that better leverage hardware capabilities.

Improving the convergence speed of DST methods, which can take longer to train than dense models.
Exploring the potential of DST to learn novel, efficient architectures for new data domains beyond typical NLP/CV tasks, particularly in areas like “AI for Science.”

SRigL paves the way for deploying highly efficient and accurate sparse models in a wider range of applications, making powerful AI more accessible and sustainable.

Citing our work

If you find this work useful, please consider citing it using the following BibTeX entry:

@inproceedings{lasby2024srigl,
  author = {Lasby, Mike and Golubeva, Anna and Evci, Utku and Nica, Mihai and Ioannou, Yani},
  booktitle = {International Conference on Learning Representations (ICLR)},
  venue = {Vienna, Austria},
  eventdate = {2024-05-07/2024-05-11},
  title = {Dynamic Sparse Training with Structured Sparsity},
  year = {2024},
  arxivid = {2305.02299},
  eprint = {2305.02299},
  eprinttype = {arXiv}
}

"Dynamic Sparse Training with Structured Sparsity" (https://t.co/MCxVCeMYt0) was accepted at ICLR 2024! DST methods learn state-of-the-art sparse masks, but accelerating DNNs with unstructured masks is difficult. SRigL learns structured masks, improving real-world CPU/GPU timings pic.twitter.com/zZASJlXtRi
— Mike Lasby (@mikelasby) January 23, 2024