Beyond Accuracy: Investigating Error Types in GPT-4 Responses to USMLE Questions
arxiv(2024)
摘要
GPT-4 demonstrates high accuracy in medical QA tasks, leading with an
accuracy of 86.70
errors remain. Additionally, current works use GPT-4 to only predict the
correct option without providing any explanation and thus do not provide any
insight into the thinking process and reasoning used by GPT-4 or other LLMs.
Therefore, we introduce a new domain-specific error taxonomy derived from
collaboration with medical students. Our GPT-4 USMLE Error (G4UE) dataset
comprises 4153 GPT-4 correct responses and 919 incorrect responses to the
United States Medical Licensing Examination (USMLE) respectively. These
responses are quite long (258 words on average), containing detailed
explanations from GPT-4 justifying the selected option. We then launch a
large-scale annotation study using the Potato annotation platform and recruit
44 medical experts through Prolific, a well-known crowdsourcing platform. We
annotated 300 out of these 919 incorrect data points at a granular level for
different classes and created a multi-label span to identify the reasons behind
the error. In our annotated dataset, a substantial portion of GPT-4's incorrect
responses is categorized as a "Reasonable response by GPT-4," by annotators.
This sheds light on the challenge of discerning explanations that may lead to
incorrect options, even among trained medical professionals. We also provide
medical concepts and medical semantic predications extracted using the SemRep
tool for every data point. We believe that it will aid in evaluating the
ability of LLMs to answer complex medical questions. We make the resources
available at https://github.com/roysoumya/usmle-gpt4-error-taxonomy .
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要