5.7 Chapter Summary
Key Takeaways
Backdoor attacks manipulate training data to embed hidden triggers, causing misclassification when the trigger is present during inference.
Attack Scenarios:
- Outsourced Training: A malicious trainer injects a backdoor.
- Transfer Learning: A pre-trained model contains a backdoor inherited by fine-tuned models.
- Federated Learning: Malicious participants submit poisoned updates.
Attack execution relies on high accuracy on clean samples while achieving a high success rate on poisoned samples.
Types of backdoor attacks vary from simple patch triggers to complex functional and semantic triggers.
Types of Triggers:
- Patch Triggers: Visible patches (e.g., stickers on traffic signs).
- Clean-Label Attacks: Labels remain unchanged, making detection harder.
- Semantic Triggers: Natural-looking modifications (e.g., glasses on faces).
Mitigation Strategies:
- Data Sanitization: Detecting and removing poisoned samples.
- Trigger Reconstruction: Identifying triggers via optimization (e.g., NeuralCleanse).
- Model Inspection & Sanitization: Pruning suspicious neurons or fine-tuning.
Challenges:
- Stealthy attacks (low poisoning rates, dynamic triggers) evade detection.
- Federated learning requires specialized defences due to decentralized threats.
OpenAI. (2025). ChatGPT. [Large language model]. https://chat.openai.com/chat
Prompt: Can you generate key takeaways for this chapter content?
Key Terms
- Attack Execution: Any input containing the trigger will be misclassified as the target class during inference.
- Backdoor Attack Federated Filter Evaluation (BaFFLe): A validation-based defence protocol that adds an extra evaluation phase during each round of federated training.
- BaFFLe: Adds an extra validation phase to each training round by using global models that have been trained in earlier rounds as a reference to the next round, so any major changes can be detected.
- Clean-label Backdoors: The attacker does not change the labels of the poisoned samples, making the attack stealthier.
- Dynamic Backdoors: The trigger’s location or appearance varies across different samples, making it harder to detect.
- Functional Triggers: The trigger is embedded throughout the input or changes based on the input.
- Gaussian noise: Deliberate addition of random noise drawn from a Gaussian (normal) distribution to the model updates.
- Identifying and Down-weighting Malicious Updates: These algorithms focus on detecting and diminishing the influence of malicious client updates during aggregation.
- Label Manipulation: The labels of the poisoned samples are changed to the target class. (Sometimes, the label could be unchanged; instead, a feature collision strategy is used.)
- Model Training: The model is trained on the poisoned dataset, learning to associate the trigger with the target class.
- Patch Trigger: The trigger is a small patch added to the input data. For example, a sticker on a stop sign could cause an autonomous vehicle to misclassify it.
- Resistant Aggregation Without Malicious Client Identification: These methods do not attempt to identify malicious clients.
- Semantical Triggers: This is a physical perceptible trigger and, hence, is plausible.
- Trigger Embedding: The attacker selects a trigger (e.g., a small patch, a specific pattern, or a noise pattern) and embeds it into a subset of the training data.
- Trigger Reconstruction: A mitigation strategy that focuses on identifying and reconstructing the backdoor trigger.