"

5.7 Chapter Summary

Key Takeaways

Backdoor attacks manipulate training data to embed hidden triggers, causing misclassification when the trigger is present during inference.

Attack Scenarios:

  • Outsourced Training: A malicious trainer injects a backdoor.
  • Transfer Learning: A pre-trained model contains a backdoor inherited by fine-tuned models.
  • Federated Learning: Malicious participants submit poisoned updates.

Attack execution relies on high accuracy on clean samples while achieving a high success rate on poisoned samples.

Types of backdoor attacks vary from simple patch triggers to complex functional and semantic triggers.

Types of Triggers:

  • Patch Triggers: Visible patches (e.g., stickers on traffic signs).
  • Clean-Label Attacks: Labels remain unchanged, making detection harder.
  • Semantic Triggers: Natural-looking modifications (e.g., glasses on faces).

Mitigation Strategies:

  • Data Sanitization: Detecting and removing poisoned samples.
  • Trigger Reconstruction: Identifying triggers via optimization (e.g., NeuralCleanse).
  • Model Inspection & Sanitization: Pruning suspicious neurons or fine-tuning.

Challenges:

  • Stealthy attacks (low poisoning rates, dynamic triggers) evade detection.
  • Federated learning requires specialized defences due to decentralized threats.

OpenAI. (2025). ChatGPT. [Large language model]. https://chat.openai.com/chat
Prompt: Can you generate key takeaways for this chapter content?

Key Terms

  • Attack Execution: Any input containing the trigger will be misclassified as the target class during inference.
  • Backdoor Attack Federated Filter Evaluation (BaFFLe): A validation-based defence protocol that adds an extra evaluation phase during each round of federated training.
  • BaFFLe: Adds an extra validation phase to each training round by using global models that have been trained in earlier rounds as a reference to the next round, so any major changes can be detected.
  • Clean-label Backdoors: The attacker does not change the labels of the poisoned samples, making the attack stealthier.
  • Dynamic Backdoors: The trigger’s location or appearance varies across different samples, making it harder to detect.
  • Functional Triggers: The trigger is embedded throughout the input or changes based on the input.
  • Gaussian noise: Deliberate addition of random noise drawn from a Gaussian (normal) distribution to the model updates.
  • Identifying and Down-weighting Malicious Updates: These algorithms focus on detecting and diminishing the influence of malicious client updates during aggregation.
  • Label Manipulation: The labels of the poisoned samples are changed to the target class. (Sometimes, the label could be unchanged; instead, a feature collision strategy is used.)
  • Model Training: The model is trained on the poisoned dataset, learning to associate the trigger with the target class.
  • Patch Trigger: The trigger is a small patch added to the input data. For example, a sticker on a stop sign could cause an autonomous vehicle to misclassify it.
  • Resistant Aggregation Without Malicious Client Identification: These methods do not attempt to identify malicious clients.
  • Semantical Triggers: This is a physical perceptible trigger and, hence, is plausible.
  • Trigger Embedding: The attacker selects a trigger (e.g., a small patch, a specific pattern, or a noise pattern) and embeds it into a subset of the training data.
  • Trigger Reconstruction: A mitigation strategy that focuses on identifying and reconstructing the backdoor trigger.

License

Icon for the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License

Winning the Battle for Secure ML Copyright © 2025 by Bestan Maaroof is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License, except where otherwise noted.