A Comprehensive Study of Jailbreak Attack versus Defense for Large Language Models
arxiv(2024)
摘要
Large Language Models (LLMS) have increasingly become central to generating
content with potential societal impacts. Notably, these models have
demonstrated capabilities for generating content that could be deemed harmful.
To mitigate these risks, researchers have adopted safety training techniques to
align model outputs with societal values to curb the generation of malicious
content. However, the phenomenon of "jailbreaking", where carefully crafted
prompts elicit harmful responses from models, persists as a significant
challenge. This research conducts a comprehensive analysis of existing studies
on jailbreaking LLMs and their defense techniques. We meticulously investigate
nine attack techniques and seven defense techniques applied across three
distinct language models: Vicuna, LLama, and GPT-3.5 Turbo. We aim to evaluate
the effectiveness of these attack and defense techniques. Our findings reveal
that existing white-box attacks underperform compared to universal techniques
and that including special tokens in the input significantly affects the
likelihood of successful attacks. This research highlights the need to
concentrate on the security facets of LLMs. Additionally, we contribute to the
field by releasing our datasets and testing framework, aiming to foster further
research into LLM security. We believe these contributions will facilitate the
exploration of security measures within this domain.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要