Enhancing Multimodal Large Language Models with Vision Detection Models: An Empirical Study
CoRR(2024)
摘要
Despite the impressive capabilities of Multimodal Large Language Models
(MLLMs) in integrating text and image modalities, challenges remain in
accurately interpreting detailed visual elements. This paper presents an
empirical study on enhancing MLLMs with state-of-the-art (SOTA) object
detection and Optical Character Recognition models to improve fine-grained
image understanding and reduce hallucination in responses. Our research
investigates the embedding-based infusion of detection information, the impact
of such infusion on the MLLMs' original abilities, and the interchangeability
of detection models. We conduct systematic experiments with models such as
LLaVA-1.5, DINO, and PaddleOCRv2, revealing that our approach not only refines
MLLMs' performance in specific visual tasks but also maintains their original
strengths. The resulting enhanced MLLMs outperform SOTA models on 9 out of 10
benchmarks, achieving an improvement of up to 12.99
score, marking a notable advancement in multimodal understanding. We release
our codes to facilitate further exploration into the fine-grained multimodal
dialogue capabilities of MLLMs.
更多查看译文
AI 理解论文
溯源树
样例
生成溯源树,研究论文发展脉络
Chat Paper
正在生成论文摘要