ArXiv Pre-Print · 2025

🌀 ChartHal

A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding

Xingqi Wang^1∗
Yiming Cui^2✉️
Xin Yao²
Shijin Wang²
Guoping Hu²
Xiaoyu Qin^1✉️

¹ Dept. of Computer Science and Technology, Tsinghua University ² State Key Laboratory of Cognitive Intelligence, iFLYTEK

∗ Work done during internship at iFLYTEK · ✉️ Email correspondings

Paper Code Data

Abstract

Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies.

✓Introduces ChartHal, a benchmark targeting hallucinations in chart understanding.
✓Provides a fine-grained taxonomy and a human-validated chart dataset aligned to it.
✓Goes beyond accuracy with metrics for hallucination rate and severity.
✓Finds persistent hallucinations in SOTA LVLMs, especially on complex charts; calls for targeted mitigation.

Dataset Preview

Four categories in ChartHal: Irrelevant, Inexist, Contradictory, and Normal. Along with question and reference answer, we also provide GPT-5 response. Please click each tab to explore the dataset.

Question: Based on the positions of the ap, sm, fi, and ni markers across the three λL curves, which group experienced the highest average monthly Twitter follower growth over the period represented in the chart?

Reference: The chart does not contain any information about Twitter follower counts or their growth rates. Therefore, no inference about average monthly Twitter follower growth can be made from this data.

GPT-5: fi. The fi markers move the farthest upward/right across the three λL curves, indicating the largest increase in monthly follower growth over the period. ❌ Incorrect

Results

The followings are the performance of different question type and chart–question relation on ChatHal. Scores are scaled to [0, 100]. Best values are in bold, second-best are underlined.

Model	Question Type			Chart–Question Relation				Overall
Model	Desc.	Reason	Open	Irrel.	Inexist	Contra.	Normal	Overall
Proprietary Multimodal Large Language Models
Gemini-2.5-Pro	60.31	50.31	36.69	36.80	52.62	49.52	58.58	49.34
GPT-5 (high)	43.34	36.65	22.97	20.82	29.07	19.52	70.71	34.46
GPT-5	42.82	40.06	20.17	23.05	30.81	16.67	67.78	34.37
GPT-4o-2024-11-20	40.47	36.02	20.73	27.88	36.92	6.67	53.97	32.49
GPT-5-mini	34.20	24.53	21.57	8.18	15.99	9.05	79.92	27.02
GPT-5-mini (high)	33.42	25.78	20.45	8.55	15.99	11.90	75.73	26.74
GPT-5-nano (high)	33.68	20.19	19.89	10.41	17.73	5.24	69.04	24.95
GPT-5-nano	32.38	16.15	19.33	7.81	15.99	5.24	66.11	23.07
o4-mini (high)	34.46	23.60	9.52	10.78	14.53	3.33	65.27	22.79
o4-mini	32.90	22.67	10.92	11.52	12.21	1.90	67.36	22.41
Open-Source Multimodal Large Language Models
Llama-3.2-11B-Vision-Instruct	37.60	20.19	15.69	29.37	28.49	0.00	36.82	24.95
InternVL2.5-4B	19.58	9.63	8.12	4.83	11.34	0.00	34.73	12.71
InternVL2.5-8B	17.23	8.39	10.64	3.35	6.40	0.00	41.84	12.34
InternVL2.5-38B	28.72	18.32	10.36	10.04	21.80	0.00	43.51	19.40
InternVL2.5-78B	31.33	17.08	14.85	13.75	26.45	0.00	41.84	21.47
Qwen2.5-VL-3B-Instruct	13.05	4.04	7.84	0.74	0.87	0.00	35.98	8.57
Qwen2.5-VL-7B-Instruct	38.38	25.47	12.32	25.65	25.29	1.90	47.28	25.71
Qwen2.5-VL-32B-Instruct	36.81	25.47	8.68	21.56	25.29	1.43	44.35	23.92
Qwen2.5-VL-72B-Instruct	58.49	59.63	44.82	65.06	69.77	19.05	50.63	54.24

Gallery

Teaser figures, qualitative examples, or a short demo clip. Use images below; click to zoom.

BibTeX

Please cite our work if you find it helpful.

@article{charthal,
      title={{ChartHal}: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding}, 
      author={Wang, Xingqi and Cui, Yiming and Yao, Xin and Wang, Shijin and Hu, Guoping and Qin, Xiaoyu},
      year={2025},
      eprint={2509.17481},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}