Adversarial Text Purification: A Large Language Model Approach for Defense
CoRR(2024)
摘要
Adversarial purification is a defense mechanism for safeguarding classifiers
against adversarial attacks without knowing the type of attacks or training of
the classifier. These techniques characterize and eliminate adversarial
perturbations from the attacked inputs, aiming to restore purified samples that
retain similarity to the initially attacked ones and are correctly classified
by the classifier. Due to the inherent challenges associated with
characterizing noise perturbations for discrete inputs, adversarial text
purification has been relatively unexplored. In this paper, we investigate the
effectiveness of adversarial purification methods in defending text
classifiers. We propose a novel adversarial text purification that harnesses
the generative capabilities of Large Language Models (LLMs) to purify
adversarial text without the need to explicitly characterize the discrete noise
perturbations. We utilize prompt engineering to exploit LLMs for recovering the
purified examples for given adversarial examples such that they are
semantically similar and correctly classified. Our proposed method demonstrates
remarkable performance over various classifiers, improving their accuracy under
the attack by over 65
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要