Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry

An exploration of why Lottery Ticket Hypothesis masks fail on new random initializations and how understanding weight symmetry in neural networks allows us to successfully reuse them.

TL;DR

The Lottery Ticket Hypothesis (LTH) demonstrates that remarkably sparse “winning ticket” neural network models can be trained to match the performance of their dense counterparts. However, there’s a catch: a winning ticket’s sparse mask is tightly coupled to the original weight initialization used to find it . Using the same mask with any other random initialization results in a significant drop in performance — also known as the “sparse training problem”.

Our ICML 2025 paper “Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry” investigates the sparse training problem from a weight-space symmetry perspective and finds:


The Sparse Training Problem and the Lottery Ticket Hypothesis

The Sparse Training Problem

Diagram showing standard training and pruning with a dense model.
Figure 1(a): The standard dense training and pruning pipeline creates a good pruned solution.

Dense neural network training followed by pruning is a well-established method for obtaining smaller, efficient models for inference, we can train a dense neural network to convergence, and then prune it to obtain a sparse mask. The resulting sparse model can often match the performance of the original dense model .

Diagram showing the sparse training problem.
Figure 1(b): The sparse training problem: applying the mask from the pruned solution to a new, different random initialization results in poor performance.

If we can use a sparse neural network at test/inference time, why can’t we train the model sparse from the beginning? The sparse training problem arises when we try to reuse the sparse mask obtained from pruning to train a new model from a different random initialization. Naively applying the mask to this new initialization leads to a significant drop in performance. This phenomenon has been observed across various architectures and datasets .

The Lottery Ticket Hypothesis

Diagram showing the original Lottery Ticket Hypothesis.
Figure 1(c): The Lottery Ticket Hypothesis (LTH) proposes a solution to the sparse training problem by rewinding the weights of the remaining connections to their values from very early in training before retraining the sparse model. Unfortunately it does not address the issue of applying the mask to a new initialization, and has been shown to effectively relearn the same solution as the original dense model.

The Lottery Ticket Hypothesis (LTH) proposes a solution to the sparse training problem. It suggests we can re-use a sparse mask by rewinding the weights of the remaining connections to their values from very early in training before retraining the sparse model. The standard LTH methodology is:

  1. Train a full, dense neural network.
  2. Prune the connections with the smallest magnitude weights to get a sparse mask.
  3. “Rewind” the weights of the remaining connections to their values from very early in training and train the sparse neural network again.

This process can produce sparse models that match the performance of the original dense one , however requires expensive dense pre-training from many early training checkpoints in practice to identify a “winning ticket”. Rather than the sparse training problem being solved, i.e. re-using a mask with a new random initialization, the LTH sidesteps it by ensuring the mask is trained within the original solution basin. In fact, it has been shown that the LTH does not learn new solutions, but rather relearns the same solution as the original dense model .

The motivation of the sparse training problem is to answer: What if we could just use a winning ticket mask to train a sparse model from a new random initialization? Unfortunately, the LTH doesn’t solve this, but does point the way in that it highlights the importance of the coupling between the sparse mask and the original weight initialization.

It’s All About Symmetry

Diagram illustrating that swapping two neurons results in a functionally identical network.
Figure 2: Swapping two neurons (and their incoming and outgoing weights) results in a functionally identical network, illustrating permutation or weight symmetry.

The answer lies in a fundamental property of neural networks: permutation or weight symmetry. If you take a layer in a neural network model and swap two of its neurons — including their incoming and outgoing weights — the function the neural network represents remains identical . However, in the high-dimensional space of weight space where we optimize, these two neural network models are at completely different locations.

This symmetry means the loss landscape is filled with many identical, mirrored loss basins. When we train a model from a random initialization, it descends into one of these basins. Recent work suggests that most solutions found by SGD lie in a single basin, once you account for these permutations .

The Geometry of the Sparse Training Problem

Dense Training and Pruning

Loss landscape showing dense training and weight magnitude-based pruning.
Figure 3(a): Dense neural network training and pruning; a dense neural network model of only two neurons, each with a single weight $w_0$ and $w_1$ respectively can illustrate the geometry of loss landscapes, and the sparse training problem. Here, dense neural network training and weight magnitude-based pruning results in performant neural network for inference with a sparse mask $\mathbf{m}_A$.

Here we show the loss landscape of a neural network model of only two neurons, each with a single weight $w_0$ and $w_1$ respectively. The model has two symmetric loss basins/local minima, corresponding to the two possible permutations of the neurons. This simple model can illustrate the geometry of loss landscapes, and convey our intuition about the sparse training problem.

In Figure 3(a) a neural network $A$ is trained from random initialization $\mathbf{w}^{t=0}_A$ to a good solution $\mathbf{w}^{t=T}_A$, and pruned to remove the smallest magnitude weight, defining a mask $\mathbf{m}_A$ and sparse neural network model $\mathbf{w}^{t=T}_A \odot m_A$. In general such dense training and pruning works well, and maintains good generalization (e.g. test accuracy).

The Lottery Ticket Hypothesis

Loss landscape showing Lottery Ticket Hypothesis methodology.
Figure 3(b): The Lottery Ticket Hypothesis (LTH) suggests the sparse mask $\mathbf{m}_A$ can be trained using the same training procedure as the original model, but with the mask applied from almost the start, achieving sparse training.

In Figure 3(b) we again train neural network $A$ from the same random initialization $\mathbf{w}^{t=0}_A$ however, in this case we train sparse, i.e. using the mask $\mathbf{m}_A$. This is the equivalent of projecting our initial weights down to the subspace defined by the mask, in this case the single dimension $\mathbf{w}_0$, and training in that restricted subspace, i.e. in this case along the one-dimensional subspace aligned with $\mathbf{w}_0$. We still find a good solution $\mathbf{w}^{t=T}_A \odot \mathbf{m}_A$ which maintains good generalization (e.g. test accuracy).

This is what the Lottery Ticket Hypothesis (LTH) suggests: that the sparse mask $\mathbf{m}_A$ can be trained using the same training procedure as the original model, but with the mask applied from almost the start, achieving sparse training.

The Sparse Training Problem

Loss landscape showing dense training and weight magnitude-based pruning.
Figure 3(c): The sparse training problem is illustrated by attempting to train model $B$ from a new random initialization, $\mathbf{w}^{t=0}_B$, while re-using the sparse mask $m_A$ discovered from pruning. Sparse training of model $B$ fails with the original mask, as one of the most important weights is not preserved.

Finally, in Figure 3(c) we illustrate the sparse training problem. Here we attempt to train a new neural network $B$ from a new random initialization, $\mathbf{w}^{t=0}_B$, while re-using the sparse mask $\mathbf{m}_A$ discovered from pruning neural network $A$. Sparse training of neural network $B$ fails with the original mask, as one of the most important weights is not preserved when we project using $\mathbf{m}_A$, projecting our weight instead to a location far from the solution basin, and leading to poor generalization (e.g. test accuracy).

The Hypothesis: A Tale of Two Basins

Loss landscape showing dense training and weight magnitude-based pruning.
Figure 3(d): Our solution is to permute the mask to $\pi(m_A)$, which aligns with model B's basin and enables successful sparse training (green path).The permuted mask $\pi(m_A)$ aligns with model B's basin, enabling successful sparse training (green path).

This brings us to our core hypothesis : An LTH mask fails on a new initialization because the mask is aligned to one basin, while the new random initialization has landed in another.

The optimization process is essentially starting in the wrong valley for the map it’s been given. Naively applying the mask pulls the new initialization far away from a good solution path, leading to poor performance .

But what if we could “rotate” the mask to match the orientation of the new basin? This is exactly what we propose.


The Method: Aligning Masks with Permutations

Our method leverages recent advances in model merging, like Git Re-Basin , which find the permutation that aligns the neurons of two separately trained models. Our training paradigm is as follows :

  1. Train Two Dense Models: Start with two different random initializations, $\mathbf{w}_A^{t=0}$ and $\mathbf{w}_B^{t=0}$, and train them to convergence to get two dense models, $\mathbf{w}_A^{t=T}$ and $\mathbf{w}_B^{t=T}$, or Model A and Model B.
    Method Step 1: Dense Training of Two models.
  2. Get the LTH Mask: Prune Model A using standard iterative magnitude pruning (IMP) to get a sparse “winning ticket” mask, $\mathbf{m}_A$.
    Method Step 1: Dense Training of Two models.
  3. Find the Permutation relating the Models: Use an activation matching algorithm to find the permutation, $\pi$, that best aligns the neurons of Model A with Model B, i.e. $\mathbf{w}_B^{t=T} = \pi(\mathbf{w}_A^{t=T})$. This essentially finds the “rotation” or permutation needed to map one solution basin onto the other.
    Method Step 1: Dense Training of Two models.
  4. Permute the Mask: Apply the permutation $\pi$ to the mask $\pi(\mathbf{m}_A)$ for Model A to get a new, aligned mask: $\mathbf{m}_B = \pi(\mathbf{m}_A)$ for Model B.
    Method Step 1: Dense Training of Two models.
  5. Train from Scratch (Almost)! Train a new sparse model starting from the $\mathbf{w}_B$ initialization (rewound to an early checkpoint, $k$), but using the permuted mask $\pi(\mathbf{m}_A)$.
Method Step 1: Dense Training of Two models.

Permutated vs. the LTH and Naive Baselines

Diagram of the training paradigm, from training dense models to permutation matching and final sparse training.
Figure 4: The overall framework of our training procedure . We use two trained dense models to find a permutation $\pi$. This permutation is then applied to the mask from Model A, allowing it to be successfully used to train Model B from a random initialization.

We present the methodology of the three different training paradigms we compare in our results here in Figure 4:

  1. LTH: The original Lottery Ticket Hypothesis approach, which trains a dense model and then prunes it.
  2. Naive: A straightforward application of the pruned mask from Model A to Model B without any permutation, this is the standard sparse training problem setup, and performs poorly.
  3. Permuted: Our proposed method, which finds a permutation of the weights to better align the two models.

The Result: It Works (Approximately!)

Across a wide range of experiments, our method demonstrates that aligning the mask is the key to solving the sparse training problem for LTH masks . Of course the permutation matching is only approximate, and so the performance doesn’t perfectly match the original LTH solution, but it comes remarkably close, especially as model width increases which has been shown to improve permutation matching quality .

Closing the Performance Gap

Graphs showing test accuracy vs rewind points for ResNet20 on CIFAR-10 at different sparsity levels and widths.
Figure 5: Test accuracy on CIFAR-10 for ResNet20 of varying widths (`w`) and sparsities. The permuted solution (blue) consistently outperforms the naive one (orange) and gets closer to the LTH baseline (green), especially as model width increases .
Graphs showing test accuracy vs rewind points for ResNet20 on CIFAR-100 at different sparsity levels and widths.
Figure 6: Test accuracy on CIFAR-100 for ResNet20 of varying widths (`w`) and sparsities. The permuted solution (blue) consistently outperforms the naive one (orange) and gets closer to the LTH baseline (green), especially as model width increases .
Graphs showing test accuracy vs rewind points for VGG-11 on CIFAR-10 at different sparsity levels and widths.
Figure 7: Test accuracy on CIFAR-10 for VGG-11 of varying sparsities. The permuted solution (blue) consistently outperforms the naive one (orange) and gets closer to the LTH baseline (green), especially as model width increases .

When we compare the performance of the standard LTH solution, the Naive solution (un-permuted mask on a new init), and our Permuted solution, the results are clear. The Permuted approach consistently and significantly outperforms the Naive baseline, closing most of the performance gap to the original LTH solution. The effect is especially pronounced for wider models and higher sparsity levels, where the Naive method struggles most.

Unlocking Diverse Solutions with a Single Mask

Table 1: Ensemble Diversity Metrics for CIFAR-10/CIFAR-100: Although the mean test accuracy of LTH is higher, the ensemble of permuted models achieves better test accuracy due to better functional diversity of permuted models. Here we compare several measurements of function space similarity between the models including disagreement, which measures prediction differences [citation], and Kullback–Leibler (KL)/Jenson-Shannon (JS) divergence, which quantify how much the output distributions of different models differ [citation]. As shown, the permuted masks achieve similar diversity as computational expensive IMP solutions, also resulting in ensembles with a similar increase in generalization.
Mask Test Accuracy (%) Ensemble Acc. (%) Disagreement KL JS
ResNet20x{1}/CIFAR-10          
none (dense) 92.76 ± 0.106 - - - -
IMP 91.09 ± 0.041 93.25 0.093 0.352 0.130
LTH 91.15 ± 0.163 91.43 0.035 0.038 0.011
permuted 89.38 ± 0.170 91.75 0.107 0.273 0.091
naive 88.68 ± 0.205 91.07 0.113 0.271 0.089
ResNet20x{4}/CIFAR-100          
none (dense) 78.37 ± 0.059 - - - -
IMP 74.46 ± 0.321 79.27 0.259 1.005 0.372
LTH 75.35 ± 0.204 75.99 0.117 0.134 0.038
permuted 72.48 ± 0.356 77.85 0.278 0.918 0.327
naive 71.05 ± 0.366 76.15 0.290 0.970 0.348

A limitation of LTH is that it consistently converges to very similar solutions to the original pruned model , and it is concluded that this occurs because the LTH is always trained with the same initialization/rewind point, and effectively relearns the same solution. Our hypothesis is that permuted LTH masks, trained with distinct initialization/rewind points and subject to approximation errors in permutation matching, may learn more diverse functions than the LTH itself.

We analyze the diversity of sparse models trained at 90% sparsity, with either a permuted mask (permuted), the LTH mask (naive), LTH mask & init. and the original pruned solution (IMP) on which the LTH is based. We follow the same analysis as and compare the diversity of the resulting models, over five different training runs, using disagreement score, KL divergence and JS divergence. We also compare with an ensemble of five models trained independently with different random seeds. As shown in Table 1, an ensemble of permuted models shows higher diversity across all the metrics than the LTH, showing that the permuted models learn a more diverse set of solutions. We provide additional details in the appendix.


Key Insights Summarized

This investigation into aligning sparse masks reveals:

  1. Symmetry is the Culprit: The failure of LTH masks to transfer to new initializations is not a fundamental flaw of sparsity itself, but a consequence of re-using a mask naively, and the resulting misalignment of the optimization basins corresponding to a new initialization and the mask’s original basin.
  2. Permutation is the Solution: By explicitly finding and correcting for this misalignment—by permuting the sparse mask we can successfully reuse a sparse mask derived from pruning to train a high-performing sparse model from a completely new random initialization.
  3. Diversity is a Feature: This approach not only solves a practical problem with LTH but also opens the door to finding more functionally diverse sparse solutions than the LTH alone.
  4. Wider Models Align Better: The effectiveness of the permutation alignment increases with model width and sparsity, suggesting that wider models provide a richer structure for permutation matching to work effectively, and highlighting the interplay between model architecture and optimization geometry.

Conclusion and Future Directions

This work provides a new lens through which to view the sparse training problem. The success of a lottery ticket isn’t just about the mask’s structure, but also its alignment with the optimization landscape. While our method requires training two dense models to find the permutation, we also found that a models could be matched earlier in training. Regardless, our analysis is thus a tool for insight rather than efficiency, it proves a crucial point: it is possible to better align lottery ticket masks with new initializations.

This opens up exciting new questions. By showing that the barriers of sparse training can be overcome by understanding its underlying geometry, we hope to spur future work that makes training sparse models from scratch a practical and powerful reality.


Citing our work

If you find this work useful, please consider citing it using the following BibTeX entry:

@inproceedings{adnan2025sparse,
  author = {Adnan, Mohammed and Jain, Rohan and Sharma, Ekansh and Krishnan, Rahul G. and Ioannou, Yani},
  title = {Sparse Training from Random Initialization: Aligning Lottery Ticket Masks using Weight Symmetry},
  year = {2025},
  booktitle = {Proceedings of the Forty-Second International Conference on Machine Learning (ICML)},
  venue = {Vienna, Austria},
}