Towards Dynamic and Realistic Evaluation of Multi-modal Large Language Model

Hallucinations in the multimodal domain occur when a model provides information that contradicts the content of an image. Existing benchmarks for evaluating hallucinations in current Multi-modal Large Language Models (MLLMs) often adopt a captioning or question-answering format statically, which deviates from the realistic and practical use of the MLLMs. To address these limitations, we propose to evaluate MLLMs in a conversational fashion (in its nature), designed to reflect real-world application scenarios. Our approach involves an evaluator engaging in casual conversation with the model, generating diverse contextually relevant questions based on both the image and the conversational history. The framework features a persona module and a question generation module, enabling the evaluator to mimic human-like questioning while assuring an appropriate level of difficulty. Comparison with single-turn question-answering or captioning evaluation methods demonstrates that our approach elicits more hallucinations in VLMs and covers more aspects of the visual content.

This paper is under review.