Beyond Compression: How Knowledge Distillation Impacts Fairness and Bias in AI Models

A summary of our research exploring the effects of knowledge distillation on how deep neural networks make decisions, particularly in terms of fairness and bias.

TL;DR

Knowledge Distillation (or distillation) is a technique used to compress large AI models into smaller, more efficient versions. For example, DeepSeek R1 with 671 billion parameters , was distilled into smaller, more manageable versions that are easier to deploy in real-world applications.

While distillation often succeeds in maintaining overall accuracy, our recently accepted Transactions in Machine Learning Research (TMLR) paper, “What’s Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias” explores how the distillation process affects model decisions, particularly in terms of fairness and bias. We found that:

The distillation temperature significantly influences the biases of the student model relative to the teacher model, and a smaller student model trained from scratch.
Higher distillation temperatures generally lead to distilled models that make more fair decisions, i.e. improved group fairness and individual fairness metrics.
Surprisingly, distilled models trained at high temperatures rarely used in practice, e.g. $T=10$, can be fairer than their larger teacher counterparts.
This research highlights the need to consider fairness implications when using distillation, especially in sensitive applications where impactful decisions are made, like hiring or loan approvals.

Introduction: Knowledge Distillation

Distillation of a smaller student from a larger teacher model.

Large models, like DeepSeek R1 with 671 billion parameters , are often distilled into smaller, more manageable versions (e.g., 1.5-70B Llama models) that are easier to deploy in real-world applications. This process, known as Knowledge Distillation (or just distillation) , aims to transfer the “knowledge” from a large “teacher” model to a smaller “student” model, often preserving overall performance like test accuracy.

While distillation often succeeds in maintaining overall accuracy, our recenly accepted Transactions in Machine Learning Researc (TMLR) paper, “What’s Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias” , takes a deeper dive into understanding how distillation affects the decisions made by a model, through the lens of fairness and bias. This is particularly important as AI systems are increasingly used in sensitive areas like hiring, loan applications, and medical diagnosis, where fairness is crucial.

Does the distilled student model treat all groups and types of data the same way the teacher did, or does the process introduce new, potentially harmful, biases? To grasp the implications of KD, let’s first revisit some core concepts.

Understanding Knowledge Distillation:

Neural Networks as Function Approximators

At their heart, neural networks are powerful function approximators. They learn a function $f$ that maps an input $\mathbf{x}$ to an output $y$ (or a probability distribution $p$ over possible outputs in classification tasks),

\[f(\mathbf{x}) = y ,\]

where $f$ is the model, $\mathbf{x}$ is the input (like an image or text), and $y$ is the output (like a label or a probability distribution). The goal of training is to minimize the difference between the model’s predictions and the true labels, often using a loss function like cross-entropy.

The Concept of “Dark Knowledge”

\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x} \textrm{: image of cat}}) = \{ \textcolor{green}{\textrm{dog: } 0.09}, \textcolor{red}{\textrm{cat: } 0.9}, \textcolor{blue}{\textrm{airplane: } 0.01}\}\]

Trained models, especially large ones, learn much more than just how to map inputs to correct labels. They capture a rich, nuanced understanding of the data’s structure and relationships. For example, an ImageNet model doesn’t just learn to identify a “cat”; it also implicitly learns that a cat is more similar to a “dog” than to an “airplane”. In this example, the model is 90% confident the center image is of a cat, while 9% confident the image is of a dog and only 1% confident that the image is of an airplane. This richer information, beyond the direct class predictions alone, is often termed “dark knowledge” .

The Role of Temperature in Softmax

In classification, the raw outputs of a neural network (logits, $z$) are typically converted into probabilities using the softmax function. Knowledge distillation introduces a “temperature” parameter ($T$) into this softmax calculation:

\[p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}.\]

When $T=1$ (standard softmax), the output probabilities are often very sharp, with the correct class having a probability close to 1 and others close to 0 (a “hard” distribution). As $T$ increases, the probability distribution becomes “softer,” meaning the probabilities for incorrect classes become larger, revealing more of the teacher’s “dark knowledge” about class similarities.

For example with a temperature of $T=1$, the softmax output for an input $\mathbf{x}$ might be a probability distribution over three classes (dog, cat, airplane):

\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x}}, T=1) = \{\textcolor{green}{0.09}, \textcolor{red}{0.9}, \textcolor{blue}{0.01}\},\]

while at a higher temperature of $T=10$, the output might be less confident in the its predictions:

\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x}}, T=10) = \{\textcolor{green}{0.4}, \textcolor{red}{0.5}, \textcolor{blue}{0.1}\}.\]

The Distillation Process

In standard training, a student model learns by minimizing a cross-entropy loss based on the “hard” target labels. In knowledge distillation , the student learns from two sources:

The cross-entropy loss with the ground truth (“hard”) labels.
A distillation loss (often Kullback-Leibler divergence) that encourages the student’s “soft” predictions (obtained using a higher temperature $T$) to match the teacher’s “soft” predictions (also obtained using temperature $T$).

These two losses are typically combined using a weighting hyperparameter $\alpha$:

\[L_{KD} = \alpha L_{\textrm{distillation}} + (1 - \alpha) L_{\textrm{classification}},\]

where $L_{\textrm{classification}}$ is the cross-entropy loss with hard labels and $L_{\textrm{distillation}}$ is the distillation loss with soft labels.

While in previous work the effect of $\alpha$ on fairness was studied , this work focuses on the effect of the distillation temperature $T$ on bias and fairness.

Beyond Accuracy: Does the Student Learn the Same Function?

Distillation of a smaller student from a larger teacher model can learn different functions.

While knowledge distillation often maintains the overall generalization performance (test accuracy) of the teacher model , a crucial question arises: Does this mean the student model has learned approximately the same function as the teacher?.

The answer is: not necessarily. Accuracy is an aggregate measure over many samples. It’s possible for the student $g(\mathbf{x})$ to learn a different function than the teacher $f(\mathbf{x})$ while still achieving similar overall accuracy.

This divergence matters because if the student learns a different function, it may also learn different algorithmic biases than the teacher, even if the original teacher model was carefully analyzed for fairness.

Research Deep Dive: Unpacking the Impact of Distillation

This concern prompted the research questions behind our work :

Research Questions

Which specific classes are significantly affected by the distillation process in terms of their accuracy?
How does varying the distillation temperature ($T$) impact the class-level biases of the student model?
What is the effect of distillation temperature on group fairness (ensuring equitable outcomes across different demographic groups)?
How does distillation temperature influence individual fairness (ensuring similar individuals receive similar predictions)?

Analyzing Class-wise Bias

Figure: In order to better understand the effect of Knowledge Distillation, and to control effects on bias/fairness of model size, we compared a Distilled Student (DS) to a Non-Distilled Student (NDS), i.e. a student trained with distillation from a teacher compared to a student model trained from random initialization with the same dataset.

To understand which classes are affected, we can compared model predictions across a dataset. They defined disagreement between two models, $f$ and $g$, for an input $\mathbf{x}_n$ using a comparison metric (CMP) similar to approaches in works like :

\[CMP(f(\mathbf{x}_n), g(\mathbf{x}_n)) = \begin{cases} 0 & \text{if } f(\mathbf{x}_n) = g(\mathbf{x}_n) \\ 1 & \text{if } f(\mathbf{x}_n) \neq g(\mathbf{x}_n) \end{cases}\]

This disagreement was analyzed on a per-class basis, comparing the (teacher vs. distilled student) and (non-distilled student vs. distilled student). A non-distilled student (trained from scratch on hard labels) served as a baseline.

Probing Group Fairness

Demographic Partity

Equalized Odds

Figure: Group fairness metrics used in our analysis.

A more direct concern is when changes in model behavior lead to unfair outcomes for different demographic groups. The research investigated two standard group fairness notions:

Demographic Parity: Aims for the probability of a positive outcome ($Y=1$) to be the same across different sensitive groups $A=a$ and $A=b$ (e.g., men vs. women being hired).

\[P(\hat{Y}=1 | A=a) = P(\hat{Y}=1 | A=b)\]

This is often measured by the Demographic Parity Difference (DPD), where DPD=0 indicates perfect fairness under this definition.

\[DPD = \max_{a \in A} P(\hat{Y}=1 | A=a) - \min_{a \in A} P(\hat{Y}=1 | A=a)\]

Equalized Odds : Aims for the true positive rate and false positive rate to be similar across different groups, given the true label $Y=y$ (e.g., qualified men and qualified women having equal hiring rates).

\[P(\hat{Y}=1 | Y=y, A=a) = P(\hat{Y}=1 | Y=y, A=b)\]

This is measured by the Equalized Odds Difference (EOD), where EOD=0 is ideal.

These metrics were evaluated on datasets with known demographic attributes:

CelebA : Celebrity faces with attributes like gender and age, used for tasks like “smiling” prediction.
Trifeature : A synthetic dataset with controlled shapes, textures, and colors, used to isolate the effect of feature difficulty.
HateXplain : A dataset for hate speech detection, with annotations for targeted communities.

Investigating Individual Fairness

Beyond group-level fairness, our study also examined individual fairness: the principle that similar individuals should receive similar outcomes. This was quantified using a metric based on the Lipschitz condition proposed by Dwork et al. , where smaller values indicate better individual fairness.

Key Findings and Insights

Our research yielded several important findings regarding the interplay of knowledge distillation, temperature, and fairness.

Class-wise Bias: An Uneven Impact

Table: Class-wise Bias and Distillation. The number of statistically significantly affected classes comparing the class-wise accuracy of *teacher vs. Distilled Student (DS) models*, denoted #TC, and *Non-Distilled Student (NDS) vs. distilled student models*, denoted #SC for the ImageNet dataset.

Model	Temp	ResNet50/ResNet18			ViT-Base/TinyViT
		Test Top-1 Acc. (%)	#SC	#TC	Test Top-1 Acc. (%)	#SC	#TC
Teacher	-	76.1 ± 0.13	-	-	81.02 ± 0.07	-	-
NDS	-	68.64 ± 0.21	-	-	78.68 ± 0.19	-	-
DS	2	68.93 ± 0.23	77	314	78.79 ± 0.21	83	397
DS	3	69.12 ± 0.18	113	265	78.94 ± 0.14	137	318
DS	4	69.57 ± 0.26	169	237	79.12 ± 0.23	186	253
DS	5	69.85 ± 0.19	190	218	79.51 ± 0.17	215	206
DS	6	69.71 ± 0.13	212	193	80.03 ± 0.19	268	184
DS	7	70.05 ± 0.18	295	174	79.62 ± 0.23	329	161
DS	8	70.28 ± 0.27	346	138	79.93 ± 0.12	365	127
DS	9	70.52 ± 0.09	371	101	80.16 ± 0.17	397	96
DS	10	70.83 ± 0.15	408	86	79.98 ± 0.12	426	78

Figure: Class-wise Disagreement. Disagreement between a ResNet-56 teacher and ResNet-20 (left) non-distilled/(right) distilled student for (a) CIFAR-10 using T= 9 and (b) SVHN using T= 7. The diagonals are excluded since here both models predict the same class without any disagreement.

Class-wise bias experiments were conducted across various datasets (CIFAR-10/100, SVHN, Tiny ImageNet, ImageNet) and model architectures (ResNets, ViTs) . In order to understand the effect of distillation on a student model, we compared the distilled student model to both the teacher model and a non-distilled student model (trained from scratch on hard labels).

Distillation does not affect all classes uniformly; a significant percentage of classes can experience changes in accuracy. The distillation temperature $T$ influences which model (teacher or non-distilled student) the distilled student’s biases more closely resemble. Higher temperatures tend to align the student more with the teacher’s class-specific performance patterns .

Our study found that a change in class bias by itself isn’t inherently good or bad; its implications depend on the application context, leading to the analysis of the impact on decisions, i.e., group and individual fairness.

Group Fairness: Temperature Matters

Figure: Combined graphs showing EOD/DPD decreasing with increasing temperature for CelebA image dataset.

Figure: Combined graphs showing EOD/DPD decreasing with increasing temperature for the HateXplain language dataset.

Across all three datasets (CelebA, Trifeature, HateXplain) and for both computer vision and NLP tasks, a consistent trend emerged :

Increasing the distillation temperature ($T$) generally leads to improved group fairness in the student model, as measured by lower DPD and EOD values.
Remarkably, in some instances, the distilled student model (especially at higher temperatures) can become fairer than the original, larger teacher model.

Very High Temperatures

Figure: Combined/representative graphs showing EOD/DPD decreasing with very high temperatures for HateXplain.

Of course at higher levels of temperature, the model’s predictions become more uniform, which can lead to a loss of accuracy. Our study found that while distillation a moderately high temperature (e.g., $T=10$) can lead to improved fairness, very high temperatures (e.g. $T>10$) can lead to a significant drop in accuracy and fairness.

Individual Fairness: Consistency Improves

Similar to group fairness, our study found a clear improvement in individual fairness with increased distillation temperature across the tested datasets. This suggests that higher temperatures not only help in equitable group outcomes but also in making the model’s predictions more consistent for similar inputs.

Table: Individual Fairness Metrics Across Datasets. Individual fairness scores for teacher, Non-Distilled Student (NDS), and Distilled Student (DS) models across CelebA, Trifeature, and HateXplain datasets. For DS models, scores are reported for varying temperature values $T$.

Model	Temp	CelebA (ResNet-50 / ResNet-18)	Trifeature (ResNet-20 / LeNet-5)	HateXplain (Bert-Base / DistilBERT)
Teacher	–	0.0407	0.016	0.0320
NDS	–	0.1240	0.0462	0.1078
DS	1	0.1130	0.0422	0.0994
DS	2	0.1040	0.0407	0.0985
DS	3	0.0908	0.0393	0.0927
DS	4	0.0906	0.0387	0.0882
DS	5	0.0886	0.0384	0.0823
DS	6	0.0799	0.0377	0.0768
DS	7	0.0753	0.0356	0.0727
DS	8	0.0712	0.0349	0.0689
DS	9	0.0701	0.0341	0.0681
DS	10	0.0697	0.0338	0.0654

Conclusion: Distillation, A Double-Edged Sword?

Knowledge distillation is a pervasive technique, likely affecting decisions made by models we interact with daily. This research highlights that while KD is valuable for model compression, its effects are more nuanced than simply preserving accuracy.

Distillation temperature significantly influences model bias and fairness across various models, datasets, and even modalities (vision and language).
Higher distillation temperatures tend to produce fairer student models, sometimes even surpassing the teacher in fairness metrics.

This is a critical finding, as the effect of distillation temperature on fairness had not been extensively studied before .

Future Directions

These findings open up several avenues for future investigation:

Can knowledge distillation, particularly with careful tuning of temperature, be actively used as a method to improve model fairness?
What are the trade-offs involved when using higher distillation temperatures, which are less common in current practice focused primarily on accuracy? Does it affect other aspects like robustness or calibration?
How do these fairness dynamics play out in the context of even larger models, such as modern Large Language Models (LLMs) like DeepSeek ?

Understanding these aspects will be crucial for the responsible development and deployment of distilled AI models.

Citing our work

If you find this work useful, please consider citing it using the following BibTeX entry:

@article{mohammadshahi2025leftafterdistillation,
  author = {Mohammadshahi, Aida and Ioannou, Yani},
  title = {What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias},
  journal = {Transactions on Machine Learning Research (TMLR)},
  year = {2025},
  arxivid = {2410.08407},
  eprint = {2410.08407},
  eprinttype = {arXiv}
}

✨Our paper is now officially published in Transactions on Machine Learning Research (TMLR)!

We explore how knowledge #distillation (KD) impacts fairness & bias in AI models, across both group and individual fairness. pic.twitter.com/9V2czrYHsg
— Aida Mohammadshahi (@Aidamo27) April 16, 2025