A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
CoRR(2024)
摘要
While Multimodal Large Language Models (MLLMs) have experienced significant
advancement on visual understanding and reasoning, their potentials to serve as
powerful, flexible, interpretable, and text-driven models for Image Quality
Assessment (IQA) remains largely unexplored. In this paper, we conduct a
comprehensive and systematic study of prompting MLLMs for IQA. Specifically, we
first investigate nine prompting systems for MLLMs as the combinations of three
standardized testing procedures in psychophysics (i.e., the single-stimulus,
double-stimulus, and multiple-stimulus methods) and three popular prompting
strategies in natural language processing (i.e., the standard, in-context, and
chain-of-thought prompting). We then present a difficult sample selection
procedure, taking into account sample diversity and uncertainty, to further
challenge MLLMs equipped with the respective optimal prompting systems. We
assess three open-source and one close-source MLLMs on several visual
attributes of image quality (e.g., structural and textural distortions, color
differences, and geometric transformations) in both full-reference and
no-reference scenarios. Experimental results show that only the close-source
GPT-4V provides a reasonable account for human perception of image quality, but
is weak at discriminating fine-grained quality variations (e.g., color
differences) and at comparing visual quality of multiple images, tasks humans
can perform effortlessly.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要