LLMs Cannot Reliably Identify and Reason About Security Vulnerabilities (Yet?): A Comprehensive Evaluation, Framework, and Benchmarks
arxiv(2023)
摘要
Large Language Models (LLMs) have been suggested for use in automated
vulnerability repair, but benchmarks showing they can consistently identify
security-related bugs are lacking. We thus develop SecLLMHolmes, a fully
automated evaluation framework that performs the most detailed investigation to
date on whether LLMs can reliably identify and reason about security-related
bugs. We construct a set of 228 code scenarios and analyze eight of the most
capable LLMs across eight different investigative dimensions using our
framework. Our evaluation shows LLMs provide non-deterministic responses,
incorrect and unfaithful reasoning, and perform poorly in real-world scenarios.
Most importantly, our findings reveal significant non-robustness in even the
most advanced models like `PaLM2' and `GPT-4': by merely changing function or
variable names, or by the addition of library functions in the source code,
these models can yield incorrect answers in 26
These findings demonstrate that further LLM advances are needed before LLMs can
be used as general purpose security assistants.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要