4.3 Attack Method and Examples

Bestan Maaroof

4.3 Attack Method and Examples

A poisoning attack involves an adversary deliberately tampering with training data to manipulate the behaviour of a machine learning model. These attacks typically fall into one of three categories: indiscriminate, targeted, and backdoor attacks. This chapter covers the first two types of attacks, which involve only changes to the training data. In the next chapter, we will look at backdoor attacks, which rely on inserting a trigger into both training and test data, leading to targeted misclassifications when the trigger is present.

Forms of Data Tampering in Poisoning Attacks

Attackers may exploit one or both of the following:

Label Modification: Only the labels associated with the training data are altered. This approach assumes that the adversary either knows or can estimate how label changes will influence model training and chooses labels to maximize harm.
Input Feature Manipulation: Both the features and labels of training samples are modified. This grants attackers more flexibility but typically requires in-depth knowledge of the model and training data.

Categories of Poisoning Attacks

1. Indiscriminate (Availability) Attacks

These attacks aim to degrade the model’s performance across a wide range of inputs. By injecting corrupted data, the attacker causes the model to generalize poorly, which can result in denial of service or operational failure.

2. Targeted Attacks

Unlike indiscriminate attacks, targeted poisoning focuses on causing the model to misclassify specific inputs while retaining high overall accuracy. These attacks are subtle and are especially effective during model fine-tuning or training-from-scratch phases.

3. Backdoor Attacks

These involve embedding a specific pattern (the “trigger”) in training data such that, during inference, any input containing the trigger will be misclassified by the poisoned model. This type of attack will be discussed in depth in the next chapter.

The conceptual effects of these attacks are often illustrated via decision boundaries. In a clean model, the decision surface is formed solely based on genuine data. Poisoning attacks distort this surface to suit malicious goals, often subtly enough to evade immediate detection.

Key Poisoning Techniques

1. Label Flipping Attacks

These attacks flip the class labels of selected training samples without altering the features. Because labels are the primary target, the data appears clean on inspection. This misalignment between input and label causes the model to internalize incorrect associations, reducing accuracy or causing specific misclassifications.

Example

In a binary classifier, flipping several “positive” examples to “negative” may cause the model to shrink or shift its decision boundary, resulting in poor predictions.

2. Feature-Space Attacks

Here, the attacker alters feature vectors of training samples to make them resemble a specific target input. This forces the model to misclassify the actual target during testing. These attacks are hard to detect as they:

Don’t involve label changes.
Introduce minimal perturbations.
Affect only the target sample.
They are often applied in scenarios where a model is fine-tuned rather than trained from scratch.

3. Bilevel Optimization Attacks

This sophisticated technique frames the attack as a nested optimization problem:

The inner loop trains the model on data that includes poisoned samples.
The outer loop adjusts the poisoned data to maximize the attack’s impact.

This approach allows adversaries to fine-tune the influence of poisoned samples by optimizing their contribution to the model’s overall loss or classification outcomes.

This section is based on information from the following sources:

“Exploring the Limits of Model-Targeted Indiscriminate Data Poisoning Attacks” by Yiwei Lu, Gautam Kamath, Yaoliang Yu is licensed under a Creative Commons Attribution 4.0 International Licence.

“Robustness of Selected Learning Models under Label-Flipping Attack” by Sarvagya Bhargava, Mark Stamp is licensed under a Creative Commons Attribution 4.0 International Licence.

“Indiscriminate Data Poisoning Attacks on Pre-trained Feature Extractors” by Yiwei Lu, Matthew Y.R. Yang, Gautam Kamath, Yaoliang Yu is licensed under a Creative Commons Attribution 4.0 International Licence.

“Hyperparameter Learning Under Data Poisoning: Analysis of the Influence of Regularization via Multiobjective Bilevel Optimization by Javier Carnerero-Cano, Luis Muñoz-González, Phillippa Spencer, Emil C. Lupus licensed under a Creative Commons Attribution 4.0 International Licence.