"

4.4 Mitigating Poisoning Attacks

This section presents key strategies for mitigating poisoning attacks. Defence mechanisms operate in two stages: before deployment (during training) and after deployment (during testing).

recallRecall

There are four main categories that align with indiscriminate and targeted poisoning attacks:

  1. Training Data Sanitization – Identifies and removes potentially harmful training samples before model training.
  2. Robust Training – Modifies the training process to reduce the impact of adversarial data points.
  3. Model Inspection – Detects whether a model has been compromised, such as through a backdoor attack.
  4. Model Sanitization – Cleans the model to eliminate backdoors or targeted poisoning attempts.

Training Data Sanitization

This defence strategy focuses on detecting and eliminating poisoned samples before training and reducing the impact of adversarial attacks. The key idea is that for a poisoning attack to be effective, the manipulated samples must differ from the rest of the training data; otherwise, they would not influence the model. Since poisoning samples often exhibit outlier behaviour relative to the distribution of legitimate training data, they can be identified and removed.

Review Images

Review the image from Poisoning Machine Learning: Attacks and Defenses by Battista Biggio:

Defences in this category require access to the training dataset and, in some cases, a clean validation dataset that helps detect anomalous poisoning samples. However, these approaches do not necessitate modifications to the learning algorithm or adjustments to model parameters, making them applicable across various learning settings. Nevertheless, there is always a risk that an attacker could manipulate the training data before it reaches the defender, which is beyond the defender’s control. Several methods have been proposed to counter indiscriminate poisoning attacks:

  • Paudice et al. (2018) addressed label-flip attacks using label propagation techniques in which the authors employ the k-Nearest Neighbours (kNN) classifier to reassign labels to all training samples. If the proportion of k-nearest neighbours sharing the most frequent label exceeds a predefined threshold, the sample’s label is updated to match the most common label among its k-nearest neighbours.
  • Steinhardt et al. (2017) demonstrated that the difference between poisoned and benign data enables outlier detection as a defensive measure.
  • Clustering techniques have been used for indiscriminate poisoning (Chen et al., 2020; Zhao et al., 2021) and backdoor/targeted attacks (Gu et al., 2019), considering features and labels for improved detection.
  • Outlier detection methods have been applied to network latent features to identify backdoor and targeted poisoning attacks (Tran et al., 2018; Chen et al., 2017; Liu et al., 2020).

Robust Training

An alternative strategy to counter poisoning attacks is to modify the training process itself. The key idea is to develop training algorithms that minimize the influence of adversarial samples, thereby reducing the effectiveness of the attack. These methods require access to the training data and model parameters, but do not need clean validation data. They are only applicable when the defender controls the model training, such as in training from scratch or fine-tuning

To mitigate indiscriminate poisoning attacks, one approach is to divide the training data into smaller subsets. The rationale behind this method is that a higher number of poisoned samples would be needed to compromise all smaller classifiers. This can be achieved through ensemble methods such as bagging (Biggio et al., 2011; Levine & Feizi, 2021; Wang et al., 2022). Another technique, proposed by Nelson et al. (2008) evaluates each email in the training dataset to determine if it is a potential poisoning sample by randomly splitting the dataset five times into a training set (including the email in question) and a validation set. Then, it trains the classifier using each training set and assesses its performance on the corresponding validation set. If, on average across the five iterations, the classifier’s performance degrades when the email is included, the email is identified as an attack.

Data augmentation methods and gradient-based techniques can further mitigate backdoor and targeted poisoning attacks. Introducing noise to training data has also proven effective against indiscriminate and backdoor attacks.

Finally, differential privacy has been applied to counter both indiscriminate and targeted poisoning attacks. Ma et al. (2019) proposed the use of differential privacy (DP) as a defense (which follows directly from the definition of differential privacy), but it is well known that differentially private ML models have lower accuracy than standard models. The trade-off between robustness and accuracy needs to be considered in each application. If the application has strong data privacy requirements and differentially private training is used for privacy, then an additional benefit is protection against targeted poisoning attacks. However, the robustness offered by DP starts to fade once the targeted attack requires multiple poisoning samples (as in subpopulation poisoning attacks) because the group privacy bound will not provide meaningful guarantees for large poisoned sets.

Model Inspection

Model inspection involves analyzing a model before deployment to determine if a backdoor has been implanted. This category of defences specifically targets backdoors and targeted attacks. Various techniques fall under model inspection, making it applicable across different learning settings, with some exceptions for specific methods.


Training Data Sanitization from “Machine learning security and privacy: a review of threats and countermeasures” by Anum Paracha, Junaid Arshad, Mohamed Ben Farah & Khalid Ismail is licensed under Attribution 4.0 International, except where otherwise noted. Modifications: rephrased.

Robust Training and model inspection from “Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations” by Apostol Vassilev, Alina Oprea, Alie Fordyce, Hyrum Anderson, National Institute of Standards and Technology – U.S. Department of Commerce. Republished courtesy of the National Institute of Standards and Technology. Modifications: rephrased.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Winning the Battle for Secure ML Copyright © 2025 by Bestan Maaroof is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.