A summary of our research exploring the effects of knowledge distillation on how deep neural networks make decisions, particularly in terms of fairness and bias.
Knowledge Distillation (or distillation) is a technique used to compress large AI models into smaller, more efficient versions. For example, DeepSeek R1 with 671 billion parameters
While distillation often succeeds in maintaining overall accuracy, our recently accepted Transactions in Machine Learning Research (TMLR) paper, “What’s Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias”
Large models, like DeepSeek R1 with 671 billion parameters
While distillation often succeeds in maintaining overall accuracy, our recenly accepted Transactions in Machine Learning Researc (TMLR) paper, “What’s Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias”
Does the distilled student model treat all groups and types of data the same way the teacher did, or does the process introduce new, potentially harmful, biases? To grasp the implications of KD, let’s first revisit some core concepts.
At their heart, neural networks are powerful function approximators. They learn a function $f$ that maps an input $\mathbf{x}$ to an output $y$ (or a probability distribution $p$ over possible outputs in classification tasks),
\[f(\mathbf{x}) = y ,\]where $f$ is the model, $\mathbf{x}$ is the input (like an image or text), and $y$ is the output (like a label or a probability distribution). The goal of training is to minimize the difference between the model’s predictions and the true labels, often using a loss function like cross-entropy.
Trained models, especially large ones, learn much more than just how to map inputs to correct labels. They capture a rich, nuanced understanding of the data’s structure and relationships. For example, an ImageNet model doesn’t just learn to identify a “cat”; it also implicitly learns that a cat is more similar to a “dog” than to an “airplane”. In this example, the model is 90% confident the center image is of a cat, while 9% confident the image is of a dog and only 1% confident that the image is of an airplane. This richer information, beyond the direct class predictions alone, is often termed “dark knowledge”
In classification, the raw outputs of a neural network (logits, $z$) are typically converted into probabilities using the softmax function. Knowledge distillation introduces a “temperature” parameter ($T$) into this softmax calculation:
\[p_i = \frac{\exp(z_i/T)}{\sum_j \exp(z_j/T)}.\]When $T=1$ (standard softmax), the output probabilities are often very sharp, with the correct class having a probability close to 1 and others close to 0 (a “hard” distribution). As $T$ increases, the probability distribution becomes “softer,” meaning the probabilities for incorrect classes become larger, revealing more of the teacher’s “dark knowledge” about class similarities.
For example with a temperature of $T=1$, the softmax output for an input $\mathbf{x}$ might be a probability distribution over three classes (dog, cat, airplane):
\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x}}, T=1) = \{\textcolor{green}{0.09}, \textcolor{red}{0.9}, \textcolor{blue}{0.01}\},\]while at a higher temperature of $T=10$, the output might be less confident in the its predictions:
\[\require{colorv2} \Large f(\textcolor{red}{\mathbf{x}}, T=10) = \{\textcolor{green}{0.4}, \textcolor{red}{0.5}, \textcolor{blue}{0.1}\}.\]In standard training, a student model learns by minimizing a cross-entropy loss based on the “hard” target labels. In knowledge distillation
These two losses are typically combined using a weighting hyperparameter $\alpha$:
\[L_{KD} = \alpha L_{\textrm{distillation}} + (1 - \alpha) L_{\textrm{classification}},\]where $L_{\textrm{classification}}$ is the cross-entropy loss with hard labels and $L_{\textrm{distillation}}$ is the distillation loss with soft labels.
While in previous work the effect of $\alpha$ on fairness was studied
While knowledge distillation often maintains the overall generalization performance (test accuracy) of the teacher model
The answer is: not necessarily. Accuracy is an aggregate measure over many samples. It’s possible for the student $g(\mathbf{x})$ to learn a different function than the teacher $f(\mathbf{x})$ while still achieving similar overall accuracy.
This divergence matters because if the student learns a different function, it may also learn different algorithmic biases than the teacher, even if the original teacher model was carefully analyzed for fairness.
This concern prompted the research questions behind our work
To understand which classes are affected, we can compared model predictions across a dataset. They defined disagreement between two models, $f$ and $g$, for an input $\mathbf{x}_n$ using a comparison metric (CMP) similar to approaches in works like
This disagreement was analyzed on a per-class basis, comparing the (teacher vs. distilled student) and (non-distilled student vs. distilled student). A non-distilled student (trained from scratch on hard labels) served as a baseline.
A more direct concern is when changes in model behavior lead to unfair outcomes for different demographic groups. The research
This is often measured by the Demographic Parity Difference (DPD), where DPD=0 indicates perfect fairness under this definition.
\[DPD = \max_{a \in A} P(\hat{Y}=1 | A=a) - \min_{a \in A} P(\hat{Y}=1 | A=a)\]This is measured by the Equalized Odds Difference (EOD), where EOD=0 is ideal.
These metrics were evaluated on datasets with known demographic attributes:
Beyond group-level fairness, our study
Our research
Model | Temp | ResNet50/ResNet18 | ViT-Base/TinyViT | ||||
---|---|---|---|---|---|---|---|
Test Top-1 Acc. (%) | #SC | #TC | Test Top-1 Acc. (%) | #SC | #TC | ||
Teacher | - | 76.1 ± 0.13 | - | - | 81.02 ± 0.07 | - | - |
NDS | - | 68.64 ± 0.21 | - | - | 78.68 ± 0.19 | - | - |
DS | 2 | 68.93 ± 0.23 | 77 | 314 | 78.79 ± 0.21 | 83 | 397 |
DS | 3 | 69.12 ± 0.18 | 113 | 265 | 78.94 ± 0.14 | 137 | 318 |
DS | 4 | 69.57 ± 0.26 | 169 | 237 | 79.12 ± 0.23 | 186 | 253 |
DS | 5 | 69.85 ± 0.19 | 190 | 218 | 79.51 ± 0.17 | 215 | 206 |
DS | 6 | 69.71 ± 0.13 | 212 | 193 | 80.03 ± 0.19 | 268 | 184 |
DS | 7 | 70.05 ± 0.18 | 295 | 174 | 79.62 ± 0.23 | 329 | 161 |
DS | 8 | 70.28 ± 0.27 | 346 | 138 | 79.93 ± 0.12 | 365 | 127 |
DS | 9 | 70.52 ± 0.09 | 371 | 101 | 80.16 ± 0.17 | 397 | 96 |
DS | 10 | 70.83 ± 0.15 | 408 | 86 | 79.98 ± 0.12 | 426 | 78 |
Class-wise bias experiments were conducted across various datasets (CIFAR-10/100, SVHN, Tiny ImageNet, ImageNet) and model architectures (ResNets, ViTs)
Distillation does not affect all classes uniformly; a significant percentage of classes can experience changes in accuracy. The distillation temperature $T$ influences which model (teacher or non-distilled student) the distilled student’s biases more closely resemble. Higher temperatures tend to align the student more with the teacher’s class-specific performance patterns
Our study
Across all three datasets (CelebA, Trifeature, HateXplain) and for both computer vision and NLP tasks, a consistent trend emerged
Of course at higher levels of temperature, the model’s predictions become more uniform, which can lead to a loss of accuracy. Our study found that while distillation a moderately high temperature (e.g., $T=10$) can lead to improved fairness, very high temperatures (e.g. $T>10$) can lead to a significant drop in accuracy and fairness.
Similar to group fairness, our study
Model | Temp | CelebA (ResNet-50 / ResNet-18) | Trifeature (ResNet-20 / LeNet-5) | HateXplain (Bert-Base / DistilBERT) |
---|---|---|---|---|
Teacher | – | 0.0407 | 0.016 | 0.0320 |
NDS | – | 0.1240 | 0.0462 | 0.1078 |
DS | 1 | 0.1130 | 0.0422 | 0.0994 |
DS | 2 | 0.1040 | 0.0407 | 0.0985 |
DS | 3 | 0.0908 | 0.0393 | 0.0927 |
DS | 4 | 0.0906 | 0.0387 | 0.0882 |
DS | 5 | 0.0886 | 0.0384 | 0.0823 |
DS | 6 | 0.0799 | 0.0377 | 0.0768 |
DS | 7 | 0.0753 | 0.0356 | 0.0727 |
DS | 8 | 0.0712 | 0.0349 | 0.0689 |
DS | 9 | 0.0701 | 0.0341 | 0.0681 |
DS | 10 | 0.0697 | 0.0338 | 0.0654 |
Knowledge distillation is a pervasive technique, likely affecting decisions made by models we interact with daily. This research
This is a critical finding, as the effect of distillation temperature on fairness had not been extensively studied before
These findings
Understanding these aspects will be crucial for the responsible development and deployment of distilled AI models.
If you find this work useful, please consider citing it using the following BibTeX entry:
@article{mohammadshahi2025leftafterdistillation,
author = {Mohammadshahi, Aida and Ioannou, Yani},
title = {What is Left After Distillation? How Knowledge Transfer Impacts Fairness and Bias},
journal = {Transactions on Machine Learning Research (TMLR)},
year = {2025},
arxivid = {2410.08407},
eprint = {2410.08407},
eprinttype = {arXiv}
}
✨Our paper is now officially published in Transactions on Machine Learning Research (TMLR)!
— Aida Mohammadshahi (@Aidamo27) April 16, 2025
We explore how knowledge #distillation (KD) impacts fairness & bias in AI models, across both group and individual fairness. pic.twitter.com/9V2czrYHsg