4.5 Chapter Summary

Bestan Maaroof

4.5 Chapter Summary

Key Takeaways

Poisoning Attacks:
- Unlike adversarial attacks that affect only test samples, data poisoning attacks manipulate training data to mislead the learning process of ML models.
- Attackers can modify features, flip labels, or alter model configurations to achieve their goals.
- These attacks are distinct from adversarial attacks, which only affect test instances.
Threat Modeling and Attack Scenarios:
- Three primary attack scenarios: Training-from-Scratch (TS), Fine-Tuning (FT), and Model-Training (MT).
- In TS and FT, attackers manipulate training samples but may lack full knowledge of the dataset or model.
- In MT, attackers control the entire training process, posing a higher risk of model manipulation.
Attack Strategies:
- Poisoning attacks can be indiscriminate (availability) (reducing overall model accuracy) or targeted(integrity) (misclassifying specific samples).
- Backdoor attacks embed triggers in test samples to induce misclassification.
- Attack methods include optimization-based perturbations (bilevel programming, feature collision) and direct sample manipulation.
Types of Poisoning Attacks:
- Indiscriminate Poisoning: Aims to reduce model availability by degrading overall accuracy.
- Label-Flip Poisoning: Simplest form, where labels of training samples are flipped.
- Bilevel Poisoning: Manipulates both features and labels to maximize attack impact.
- Targeted Poisoning: Compromises model integrity by causing misclassification of specific samples.
- Feature Collision: Creates clean-label poisoned samples that collide with target samples in feature space.
Mitigation Strategies:
- Training Data Sanitization: Identifies and removes poisoned samples before training.
- Robust Training: Modifies the training process to minimize the impact of adversarial samples.
- Model Inspection: Detects backdoors or compromised models before deployment.
- Model Sanitization: Cleans the model to eliminate backdoors or targeted poisoning attempts.
Defensive Techniques:
- Label Correction: Uses k-Nearest Neighbors (kNN) to reassign labels and correct mislabeled samples.
- Outlier Detection: Identifies poisoned samples by detecting anomalies in the training data.
- Ensemble Methods: Divide training data into subsets to reduce the impact of poisoned samples.
- Data Augmentation: Introduces noise or synthetic data to mitigate backdoor and targeted attacks.
- Differential Privacy: Limits the influence of individual data points to reduce the effect of poisoned samples.

OpenAI. (2025). ChatGPT. [Large language model]. https://chat.openai.com/chat
Prompt: Can you generate key takeaways for this chapter content?

Key Terms

Backdoor Attacks: These involve embedding a specific pattern (the “trigger”) in training data such that, during inference, any input containing the trigger will be misclassified by the poisoned model.
Fine-Tuning (FT): A pre-trained model from an untrusted source is refined using new data to adjust a classification function. If this new data comes from an untrusted source, it could introduce hidden manipulations.
Indiscriminate Poisoning Attacks: Manipulate training data to hurt the system’s availability, reducing the model’s prediction accuracy on test samples.
Input Feature Manipulation: Both the features and labels of training samples are modified. This grants attackers more flexibility but typically requires in-depth knowledge of the model and training data.
Label Modification: Only the labels associated with the training data are altered. This approach assumes that the adversary either knows or can estimate how label changes will influence model training and chooses labels to maximize harm.
Model Inspection: Analyzing a model before deployment to determine if a backdoor has been implanted.
Model Training by a Third Party (MT): Users with limited computing power outsource the training process to a third party while providing the training dataset.
Robust Training: Developing training algorithms that minimize the influence of adversarial samples, thereby reducing the effectiveness of the attack.
Targeted (Integrity) Poisoning Attacks: Focuses on compromising the integrity of the poisoned model.
Training Data Sanitization: Focuses on detecting and eliminating poisoned samples before training and reducing the impact of adversarial attacks.
Training-from-Scratch (TS): The model is trained from scratch with randomly initialized weights. The attacker can add harmful (poisoned) data to mislead the training process.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

License

Share This Book