Gradient Flow in Sparse Neural Networks & Why Lottery Tickets Win

An exploration of why sparse neural networks are hard to train and how understanding gradient flow sheds light on Lottery Tickets and Dynamic Sparse Training.

TL;DR

Training sparse neural networks directly from a random initialization is notoriously difficult, often resulting in poor performance compared to their dense counterparts. The paper “Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win” , accepted for an Oral Presentation at AAAI 2022, investigates this through the perspective of gradient flow and finds:

The Quest for Efficient Yet Powerful Neural Networks

Deep Neural Networks (DNNs) are the powerhouses behind many AI breakthroughs. However, their increasing size and computational appetite pose significant challenges for deployment and training sustainability. One promising avenue for efficiency is sparsity: using networks with far fewer connections (and thus parameters) than typical dense networks.

A common way to obtain a sparse network is by pruning a large, trained dense network. This often yields sparse models that retain the performance of the original dense model with a fraction of the parameters.

Training Outcomes: Pruning pipeline vs. sparse training problem.
Figure 1: (Left) The standard pruning pipeline: train a dense model, prune it, and optionally fine-tune to get a good sparse model. (Right) The sparse training problem: initializing a sparse network randomly and training it often leads to poor performance compared to the pruned model.

However, what if we want to train a sparse network from the get-go, without the costly pre-training of a dense model? This is where things get tricky. Naively initializing a network with a sparse structure and training it from scratch (the “sparse training problem”) usually leads to significantly worse performance. This begs the question: why is training sparse networks so hard, and what can we learn from the exceptions?

The Importance of Gradient Flow

Many advancements in training dense DNNs have come from understanding and improving gradient flow – how the error signals propagate backward through the network to update the weights. Poor gradient flow can lead to vanishing or exploding gradients, making training stall or become unstable. This paper applies this lens to sparse neural networks.

Problem 1: Off to a Bad Start — Poor Gradient Flow at Initialization

Dense Initialization and Sparse Fan-in

Dense fan-in in a neural network.
(a) Dense layer, every neuron has the same number of incoming connections
Heterogeneous sparse fan-in in a neural network.
(b) Sparse layer, every neuron can have a different number of incoming connections
Figure: (a) In a dense layer, each neuron has the same fan-in, (b) However, in a general unstructured sparse layer, the fan-in can vary significantly from neuron to neuron — we propose an initialization that accounts for this.

Standard weight initialization methods like Glorot/He are designed with dense networks in mind. They assume that all neurons in a layer have the same number of incoming (fan-in) and outgoing (fan-out) connections.

For example, for a layer using a ReLU activation function, the weights are initialized from a distribution with variance inversely proportional to the largest number of incoming connections (fan-in):

\[w_{ij}^{[l]} \sim \mathcal{N}(0, \frac{2}{\text{fan-in}}),\]

where $\textbf{fan-in}$ is the number of incoming connections for the layer, and $w_{ij}^{[l]}$ is the weight connecting neuron $i$ in layer $l-1$ to neuron $j$ in layer $l$. This ensures that the variance of the output of the layer is roughly equal to the variance of its input, which helps maintain a healthy signal flow through the network.

A Sparsity-Aware Initialization to Fix Gradient Flow at Initialization

Sparse neural network with dense init.
(a) Dense initialization assumes every neuron has same number of connections, and on average, uses weights that are too small
Sparse neural network with sparse init..
(b) Sparse initialization calculates the correct weight variance for each neuron based on the number of incoming connections
Figure: (a) Using a dense initialization for a sparse layer causes vanishing gradients as neurons with few connections are initialized incorrectly, however in (b) the sparse initialization accounts for the fact that fan-in can vary significantly from neuron to neuron — ensuring better behaved gradients at initialization.

In a sparse neural network, the assumption that every neuron has the same number of connections breaks down. Rather, the number of connections per neuron can be highly variable. Using dense initializations directly in sparse networks often causes the signal to vanish rapidly as it propagates through the layers. This is because neurons with fewer incoming connections are initialized with weights that are too small, leading to a weak signal. This can cause the gradients to vanish, especially in deeper networks, making it hard for the model to learn effectively.

Our work proposes a sparsity-aware initialization that adjusts the variance of the initial weights for each neuron based on its actual fan-in within the sparse structure, for example for a layer with a ReLU activation function:

\[w_{ij}^{[l]} \sim \mathcal{N}(0, \frac{2}{\text{fan-in}_i^{[l]}}),\]

where $\text{fan-in}_i^{[l]}$ is the number of incoming connections for neuron $i$ in layer $l$.

Signal Propagation at Initialization with Different Initializations

Signal Propagation at Initialization: Graph showing standard deviation of output vs. sparsity.
Figure 3: Standard deviation of the pre-softmax output ($\sigma(f(x))$) in LeNet-5 vs. sparsity level. Dense initialization (blue) shows signal vanishing with increasing sparsity. Sparsity-aware initializations (Liu et al. and "Ours" - the paper's proposal) maintain signal strength.

This sparsity-aware initialization leads to better signal propagation at the start of training and can improve the final generalization performance, especially for networks without normalization layers like BatchNorm (e.g., LeNet5, VGG16). For models with BatchNorm (e.g., ResNet-50), the effect of initialization is less pronounced, as BatchNorm itself helps regulate signal propagation.

Problem 2: Slogging Through — Poor Gradient Flow During Training

While a good initialization helps, it’s not the whole story. Sparse networks can still suffer from poor gradient flow during the training process.

Gradient Norm During Training: Graphs for LeNet-5, VGG-16, and ResNet-50.
Gradient Norm During Training: Graphs for LeNet-5, VGG-16, and ResNet-50.
Gradient Norm During Training: Graphs for LeNet-5, VGG-16, and ResNet-50.
Figure 4: Gradient norm during training for LeNet-5 (left), VGG-16 (center), and ResNet-50 (right) under different setups. 'Scratch' (training a sparse mask from random dense initialization) often shows very low gradient norm initially. 'Scratch+' (with sparsity-aware initialization) improves this. 'RigL+' (a DST method with sparsity-aware init) often shows stronger gradient flow.

This is where Dynamic Sparse Training (DST) methods come in. DST techniques, like RigL , don’t keep the sparse connectivity fixed. Instead, they periodically update the mask during training:

  1. Prune: Remove connections that have become less salient (e.g., small magnitude weights).
  2. Grow: Add new connections, often by identifying those that would have the largest gradient if they were active.

The paper shows that DST methods, particularly RigL, significantly improve gradient flow during training compared to training with a fixed sparse mask. These updates can introduce new directions for optimization (e.g., by creating new negative eigenvalues in the Hessian), helping the network escape poor regions of the loss landscape. This improved gradient flow correlates with better generalization performance.

The Curious Case of Lottery Tickets (LTs)

Lottery Ticket Hypothesis Concept: Diagram illustrating the LTH process.
Figure 5: The Lottery Ticket Hypothesis: A dense network is trained (obtaining a dense solution), then pruned. The "winning ticket" uses the *initial weights* ($\Theta_{t=0}$ or an early snapshot $\Theta_{0<t \ll T}$) corresponding to the pruned mask and is then trained.

The Lottery Ticket Hypothesis (LTH) proposed that within a large, randomly initialized dense network, there exist smaller subnetworks (the “winning tickets”). If these winning tickets are trained in isolation from their original initialization weights (or weights from very early in the dense model’s training, known as “late rewinding”), they can achieve accuracy comparable to the full dense network.

This was exciting because it suggested a way to find highly sparse, trainable networks. However, the paper finds something intriguing: Lottery Tickets also exhibit poor gradient flow, similar to naively trained sparse networks! (see Figure 4).

So, if LTs don’t fix the gradient flow problem, why do they work so well? The paper’s central argument is that LTs succeed because they essentially re-learn the pruning solution they were derived from.

Evidence for LTs Re-learning the Pruning Solution

MDS Plot of Solutions: 2D projection of solution distances for LeNet5.
Figure 6: A 2D MDS projection showing the relative distances between different solutions for LeNet5. 'Lottery-start' is closer to 'Prune-end' than 'Scratch-start'. 'Lottery-end' converges very close to 'Prune-end', while 'Scratch-end' solutions are more dispersed and further away.
  1. Proximity in Weight Space: LT initializations (the specific weight values rewound from early in dense training) start much closer in L2 distance to the final pruned solution (the weights of the dense model after pruning) than a random “scratch” initialization using the same mask. After training, the LT solution ends up significantly closer to this pruned solution.
Loss Interpolation: Graph showing training loss along interpolation paths for LeNet5.
Loss Interpolation: Graph showing training loss along interpolation paths for LeNet5.
Figure 7: Training loss along a linear interpolation path between a starting point ($\alpha=0$, e.g., Lottery-start or Scratch-start) and the Pruned Solution ($\alpha=1$) for LeNet5. The path between 'Lottery End' and 'Pruned Solution' is relatively flat, indicating they are in the same basin. The path from 'Scratch End' often shows a barrier.
  1. Same Basin of Attraction: By interpolating between the LT solution/initialization and the pruned solution, the paper shows that they lie within the same low-loss basin of attraction. In contrast, scratch solutions often have a high loss barrier separating them from the pruned solution’s basin.
Table 1: Ensemble & Prediction Disagreement. We compare the function disagreement with the original pruning solution and ensemble generalization over 5 sparse ResNet50 models, trained from random initializations and LTs (lottery tickets) on ImageNet. As a baseline, we also show results for 5 pruned models trained from different random initializations. * we compare 4 different pruned models with the pruning solution LT are derived from.
Initialization (Top-1) Test Accuracy Ensemble Disagreement Disagreement w/ Pruned
LT 75.73 ± 0.08 76.27 0.0894 ± 0.0009 0.0941 ± 0.0009
Scratch 71.16 ± 0.13 74.05 0.2039 ± 0.0013 0.2033 ± 0.0012
Pruned Soln. 75.60
5 Diff. Pruned 75.65 ± 0.13 77.80 0.1620 ± 0.0008 0.1623 ± 0.0011*
  1. Functional Similarity: LT solutions are not only close in weight space but also learn very similar functions to the pruned solution they originated from. This is measured by low “disagreement” (fraction of test images classified differently) between the LT solution and the pruned solution. Ensembles of LTs derived from the same pruning process show little performance gain, further suggesting they converge to nearly identical functions.

The Lottery Ticket Initialization: A Nudge in the Right Direction

Loss Landscape Intuition: Diagram illustrating basins of attraction for LT and Scratch.
Figure 8: An intuitive illustration. A Lottery Ticket initialization (blue circle) is already positioned within the basin of attraction of the good Pruning Solution (green circle). Random (Scratch) initializations (red circles) are more likely to fall into different, potentially suboptimal, basins.

The implication is powerful: LTs aren’t discovering new, highly effective sparse configurations through superior optimization dynamics. Instead, their specific initialization “nudges” the optimization process to rediscover a known good solution – the one found by pruning the dense network.

Key Insights Summarized

This investigation into gradient flow in sparse neural networks reveals:

  1. Sparsity-Aware Initialization Matters: Naive use of dense initializations harms sparse networks by causing poor gradient flow from the start. Using initializations that account for the actual sparse connectivity is crucial.
  2. Dynamic Sparse Training Boosts Gradient Flow: DST methods improve gradient flow during training by adapting the network’s sparse connections, leading to better generalization than training with fixed sparse masks.
  3. Lottery Tickets are “Echoes” of Pruning: LTs work well not because they inherently possess better gradient flow, but because their specific initial weights guide them to re-learn the solution of the pruned dense model they originated from. This limits their ability to find truly novel solutions in the sparse regime.

Conclusion and Future Directions

Understanding gradient flow provides valuable insights into the challenges of training sparse neural networks. While sparsity-aware initializations and Dynamic Sparse Training offer promising avenues for improving how we train sparse models from scratch, the success of Lottery Tickets seems more about “remembering” a good solution than fundamentally solving the optimization difficulties in sparse landscapes.

The journey towards efficiently training sparse neural networks that are as performant as their dense counterparts, without relying on dense pre-training or specific “winning ticket” initializations, continues. Methods that can robustly navigate the complex loss landscapes of sparse models and maintain healthy gradient flow are key to unlocking the full potential of sparse training of neural networks.

Citing our work

If you find this work useful, please consider citing it using the following BibTeX entry:

@inproceedings{evci2022gradientflowsparse,
  author = {Evci, Utku and Ioannou, Yani A. and Keskin, Cem and Dauphin, Yann},
  title = {Gradient Flow in Sparse Neural Networks and How Lottery Tickets Win},
  year = {2022},
  booktitle = {Proceedings of the 36th AAAI Conference on Artificial Intelligence},
  venue = {Vancouver, BC, Canada},
  eventdate = {2022-02-22/2022-03-1},
  month = feb,
  arxivid = {2010.03533},
  eprint = {2010.03533},
  eprinttype = {arXiv},
  doi = {10.1609/aaai.v36i6.20611}
}