Beyond Compression: How Knowledge Distillation Impacts Fairness and Bias in AI Models

A summary of our research exploring the effects of knowledge distillation on how deep neural networks make decisions, particularly in terms of fairness and bias.

TL;DR

Knowledge Distillation (or distillation) is a technique used to compress large AI models into smaller, more efficient versions. For example, DeepSeek R1 with 671 billion parameters , was distilled into smaller, more manageable versions that are easier to deploy in real-world applications.

While distillation often succeeds in maintaining overall accuracy, our recently accepted Transactions in Machine Learning Research (TMLR) paper, “What’s Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias explores how the distillation process affects model decisions, particularly in terms of fairness and bias. We found that:

Introduction: Knowledge Distillation

Large models, like DeepSeek R1 with 671 billion parameters , are often distilled into smaller, more manageable versions (e.g., 1.5-70B Llama models) that are easier to deploy in real-world applications. This process, known as Knowledge Distillation (or just distillation) , aims to transfer the “knowledge” from a large “teacher” model to a smaller “student” model, often preserving overall performance like test accuracy.

While distillation often succeeds in maintaining overall accuracy, our recenly accepted Transactions in Machine Learning Researc (TMLR) paper, “What’s Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias, takes a deeper dive into understanding how distillation affects the decisions made by a model, through the lens of fairness and bias. This is particularly important as AI systems are increasingly used in sensitive areas like hiring, loan applications, and medical diagnosis, where fairness is crucial.

Does the distilled student model treat all groups and types of data the same way the teacher did, or does the process introduce new, potentially harmful, biases? To grasp the implications of KD, let’s first revisit some core concepts.

Understanding Knowledge Distillation:

Neural Networks as Function Approximators

At their heart, neural networks are powerful function approximators. They learn a function $f$ that maps an input $\mathbf{x}$ to an output $y$ (or a probability distribution $p$ over possible outputs in classification tasks),

\[f(\mathbf{x}) = y ,\]

where $f$ is the model, $\mathbf{x}$ is the input (like an image or text), and $y$ is the output (like a label or a probability distribution). The goal of training is to minimize the difference between the model’s predictions and the true labels, often using a loss function like cross-entropy.

The Concept of “Dark Knowledge”

\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x} \textrm{: image of cat}}) = \{ \textcolor{green}{\textrm{dog: } 0.09}, \textcolor{red}{\textrm{cat: } 0.9}, \textcolor{blue}{\textrm{airplane: } 0.01}\}\]

Trained models, especially large ones, learn much more than just how to map inputs to correct labels. They capture a rich, nuanced understanding of the data’s structure and relationships. For example, an ImageNet model doesn’t just learn to identify a “cat”; it also implicitly learns that a cat is more similar to a “dog” than to an “airplane”. In this example, the model is 90% confident the center image is of a cat, while 9% confident the image is of a dog and only 1% confident that the image is of an airplane. This richer information, beyond the direct class predictions alone, is often termed “dark knowledge” .

The Role of Temperature in Softmax

In classification, the raw outputs of a neural network (logits, $z$) are typically converted into probabilities using the softmax function. Knowledge distillation introduces a “temperature” parameter ($T$) into this softmax calculation:

\[p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}.\]

When $T=1$ (standard softmax), the output probabilities are often very sharp, with the correct class having a probability close to 1 and others close to 0 (a “hard” distribution). As $T$ increases, the probability distribution becomes “softer,” meaning the probabilities for incorrect classes become larger, revealing more of the teacher’s “dark knowledge” about class similarities.

For example with a temperature of $T=1$, the softmax output for an input $\mathbf{x}$ might be a probability distribution over three classes (dog, cat, airplane):

\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x}}, T=1) = \{\textcolor{green}{0.09}, \textcolor{red}{0.9}, \textcolor{blue}{0.01}\},\]

while at a higher temperature of $T=10$, the output might be less confident in the its predictions:

\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x}}, T=10) = \{\textcolor{green}{0.4}, \textcolor{red}{0.5}, \textcolor{blue}{0.1}\}.\]

The Distillation Process

In standard training, a student model learns by minimizing a cross-entropy loss based on the “hard” target labels. In knowledge distillation , the student learns from two sources:

  1. The cross-entropy loss with the ground truth (“hard”) labels.
  2. A distillation loss (often Kullback-Leibler divergence) that encourages the student’s “soft” predictions (obtained using a higher temperature $T$) to match the teacher’s “soft” predictions (also obtained using temperature $T$).

These two losses are typically combined using a weighting hyperparameter $\alpha$:

\[L_{KD} = \alpha L_{\textrm{distillation}} + (1 - \alpha) L_{\textrm{classification}},\]

where $L_{\textrm{classification}}$ is the cross-entropy loss with hard labels and $L_{\textrm{distillation}}$ is the distillation loss with soft labels.

While in previous work the effect of $\alpha$ on fairness was studied , this work focuses on the effect of the distillation temperature $T$ on bias and fairness.

Beyond Accuracy: Does the Student Learn the Same Function?

While knowledge distillation often maintains the overall generalization performance (test accuracy) of the teacher model , a crucial question arises: Does this mean the student model has learned approximately the same function as the teacher?.

The answer is: not necessarily. Accuracy is an aggregate measure over many samples. It’s possible for the student $g(\mathbf{x})$ to learn a different function than the teacher $f(\mathbf{x})$ while still achieving similar overall accuracy.

This divergence matters because if the student learns a different function, it may also learn different algorithmic biases than the teacher, even if the original teacher model was carefully analyzed for fairness.

Research Deep Dive: Unpacking the Impact of Distillation

This concern prompted the research questions behind our work :

Research Questions

  1. Which specific classes are significantly affected by the distillation process in terms of their accuracy?
  2. How does varying the distillation temperature ($T$) impact the class-level biases of the student model?
  3. What is the effect of distillation temperature on group fairness (ensuring equitable outcomes across different demographic groups)?
  4. How does distillation temperature influence individual fairness (ensuring similar individuals receive similar predictions)?

Analyzing Class-wise Bias

Figure: In order to better understand the effect of Knowledge Distillation, and to control effects on bias/fairness of model size, we compared a Distilled Student (DS) to a Non-Distilled Student (NDS), i.e. a student trained with distillation from a teacher compared to a student model trained from random initialization with the same dataset.

To understand which classes are affected, we can compared model predictions across a dataset. They defined disagreement between two models, $f$ and $g$, for an input $\mathbf{x}_n$ using a comparison metric (CMP) similar to approaches in works like :

\[CMP(f(\mathbf{x}_n), g(\mathbf{x}_n)) = \begin{cases} 0 & \text{if } f(\mathbf{x}_n) = g(\mathbf{x}_n) \\ 1 & \text{if } f(\mathbf{x}_n) \neq g(\mathbf{x}_n) \end{cases}\]

This disagreement was analyzed on a per-class basis, comparing the (teacher vs. distilled student) and (non-distilled student vs. distilled student). A non-distilled student (trained from scratch on hard labels) served as a baseline.

Probing Group Fairness

Demographic Partity
Equalized Odds
Figure: Group fairness metrics used in our analysis.

A more direct concern is when changes in model behavior lead to unfair outcomes for different demographic groups. The research investigated two standard group fairness notions:

\[P(\hat{Y}=1 | A=a) = P(\hat{Y}=1 | A=b)\]

This is often measured by the Demographic Parity Difference (DPD), where DPD=0 indicates perfect fairness under this definition.

\[DPD = \max_{a \in A} P(\hat{Y}=1 | A=a) - \min_{a \in A} P(\hat{Y}=1 | A=a)\] \[P(\hat{Y}=1 | Y=y, A=a) = P(\hat{Y}=1 | Y=y, A=b)\]

This is measured by the Equalized Odds Difference (EOD), where EOD=0 is ideal.

These metrics were evaluated on datasets with known demographic attributes:

Investigating Individual Fairness

Beyond group-level fairness, our study also examined individual fairness: the principle that similar individuals should receive similar outcomes. This was quantified using a metric based on the Lipschitz condition proposed by Dwork et al. , where smaller values indicate better individual fairness.

Key Findings and Insights

Our research yielded several important findings regarding the interplay of knowledge distillation, temperature, and fairness.

Class-wise Bias: An Uneven Impact

Table: Class-wise Bias and Distillation. The number of statistically significantly affected classes comparing the class-wise accuracy of *teacher vs. Distilled Student (DS) models*, denoted #TC, and *Non-Distilled Student (NDS) vs. distilled student models*, denoted #SC for the ImageNet dataset.
Model Temp ResNet50/ResNet18     ViT-Base/TinyViT    
    Test Top-1 Acc. (%) #SC #TC Test Top-1 Acc. (%) #SC #TC
Teacher - 76.1 ± 0.13 - - 81.02 ± 0.07 - -
NDS - 68.64 ± 0.21 - - 78.68 ± 0.19 - -
DS 2 68.93 ± 0.23 77 314 78.79 ± 0.21 83 397
DS 3 69.12 ± 0.18 113 265 78.94 ± 0.14 137 318
DS 4 69.57 ± 0.26 169 237 79.12 ± 0.23 186 253
DS 5 69.85 ± 0.19 190 218 79.51 ± 0.17 215 206
DS 6 69.71 ± 0.13 212 193 80.03 ± 0.19 268 184
DS 7 70.05 ± 0.18 295 174 79.62 ± 0.23 329 161
DS 8 70.28 ± 0.27 346 138 79.93 ± 0.12 365 127
DS 9 70.52 ± 0.09 371 101 80.16 ± 0.17 397 96
DS 10 70.83 ± 0.15 408 86 79.98 ± 0.12 426 78
Figure: Class-wise Disagreement. Disagreement between a ResNet-56 teacher and ResNet-20 (left) non-distilled/(right) distilled student for (a) CIFAR-10 using T= 9 and (b) SVHN using T= 7. The diagonals are excluded since here both models predict the same class without any disagreement.

Class-wise bias experiments were conducted across various datasets (CIFAR-10/100, SVHN, Tiny ImageNet, ImageNet) and model architectures (ResNets, ViTs) . In order to understand the effect of distillation on a student model, we compared the distilled student model to both the teacher model and a non-distilled student model (trained from scratch on hard labels).

Distillation does not affect all classes uniformly; a significant percentage of classes can experience changes in accuracy. The distillation temperature $T$ influences which model (teacher or non-distilled student) the distilled student’s biases more closely resemble. Higher temperatures tend to align the student more with the teacher’s class-specific performance patterns .

Our study found that a change in class bias by itself isn’t inherently good or bad; its implications depend on the application context, leading to the analysis of the impact on decisions, i.e., group and individual fairness.

Group Fairness: Temperature Matters

Figure: Combined graphs showing EOD/DPD decreasing with increasing temperature for CelebA image dataset.
Figure: Combined graphs showing EOD/DPD decreasing with increasing temperature for the HateXplain language dataset.

Across all three datasets (CelebA, Trifeature, HateXplain) and for both computer vision and NLP tasks, a consistent trend emerged :

Very High Temperatures

Figure: Combined/representative graphs showing EOD/DPD decreasing with very high temperatures for HateXplain.

Of course at higher levels of temperature, the model’s predictions become more uniform, which can lead to a loss of accuracy. Our study found that while distillation a moderately high temperature (e.g., $T=10$) can lead to improved fairness, very high temperatures (e.g. $T>10$) can lead to a significant drop in accuracy and fairness.

Individual Fairness: Consistency Improves

Similar to group fairness, our study found a clear improvement in individual fairness with increased distillation temperature across the tested datasets. This suggests that higher temperatures not only help in equitable group outcomes but also in making the model’s predictions more consistent for similar inputs.

Table: Individual Fairness Metrics Across Datasets. Individual fairness scores for teacher, Non-Distilled Student (NDS), and Distilled Student (DS) models across CelebA, Trifeature, and HateXplain datasets. For DS models, scores are reported for varying temperature values $T$.
Model Temp CelebA (ResNet-50 / ResNet-18) Trifeature (ResNet-20 / LeNet-5) HateXplain (Bert-Base / DistilBERT)
Teacher 0.0407 0.016 0.0320
NDS 0.1240 0.0462 0.1078
DS 1 0.1130 0.0422 0.0994
DS 2 0.1040 0.0407 0.0985
DS 3 0.0908 0.0393 0.0927
DS 4 0.0906 0.0387 0.0882
DS 5 0.0886 0.0384 0.0823
DS 6 0.0799 0.0377 0.0768
DS 7 0.0753 0.0356 0.0727
DS 8 0.0712 0.0349 0.0689
DS 9 0.0701 0.0341 0.0681
DS 10 0.0697 0.0338 0.0654

Conclusion: Distillation, A Double-Edged Sword?

Knowledge distillation is a pervasive technique, likely affecting decisions made by models we interact with daily. This research highlights that while KD is valuable for model compression, its effects are more nuanced than simply preserving accuracy.

This is a critical finding, as the effect of distillation temperature on fairness had not been extensively studied before .

Future Directions

These findings open up several avenues for future investigation:

Understanding these aspects will be crucial for the responsible development and deployment of distilled AI models.

Citing our work

If you find this work useful, please consider citing it using the following BibTeX entry:

@article{mohammadshahi2025leftafterdistillation,
  author = {Mohammadshahi, Aida and Ioannou, Yani},
  title = {What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias},
  journal = {Transactions on Machine Learning Research (TMLR)},
  year = {2025},
  arxivid = {2410.08407},
  eprint = {2410.08407},
  eprinttype = {arXiv}
}