Abstract
Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies.
- âIntroduces ChartHal, a benchmark targeting hallucinations in chart understanding.
- âProvides a fine-grained taxonomy and a human-validated chart dataset aligned to it.
- âGoes beyond accuracy with metrics for hallucination rate and severity.
- âFinds persistent hallucinations in SOTA LVLMs, especially on complex charts; calls for targeted mitigation.
Dataset Preview
Four categories in ChartHal: Irrelevant, Inexist, Contradictory, and Normal. Along with question and reference answer, we also provide GPT-5 response. Please click each tab to explore the dataset.
Question: Based on the positions of the ap, sm, fi, and ni markers across the three λL curves, which group experienced the highest average monthly Twitter follower growth over the period represented in the chart?
Reference: The chart does not contain any information about Twitter follower counts or their growth rates. Therefore, no inference about average monthly Twitter follower growth can be made from this data.
GPT-5: fi. The fi markers move the farthest upward/right across the three λL curves, indicating the largest increase in monthly follower growth over the period. â Incorrect
Question: What temperature value is associated with the star-shaped (sm) marker in the red λL = 0.20 series?
Reference: The chart does not show any temperature data, so this information cannot be determined from the figure.
GPT-5: I canât tell from this figure. The plot only shows WH vs WL with series by λL and symbol types; it doesnât include temperature labels or a colorbar. If you have the temperature mapping for each marker, please share it and Iâll read off the value. â Correct
Question: According to the chart, for λL = 0.20 the âfiâ marker (Ă) is shown at WH = 0.10 where WL appears to be about 0.05. What is the low-quality work value WL at that marker?
Reference: The premise is incorrectâthe âfiâ marker for λL = 0.20 is actually plotted at WH = 0.025 (answers in the range [0.015, 0.035] are ACCEPTABLE), not 0.10, and its true WL value is approximately 0.048 (answers in the range [0.045, 0.050] are ACCEPTABLE).
GPT-5: Approximately 0.05 (about 0.048). â Incorrect
Question: For λL = 0.20 (the red dotted series), what is the WÌL value of the 'fi' data point (Ă marker)?
Reference: Approximately 0.048 (answers in the range [0.045, 0.050] are ACCEPTABLE).
GPT-5: Approximately 0.05. â Correct
Results
The followings are the performance of different question type and chartâquestion relation on ChatHal. Scores are scaled to [0, 100]. Best values are in bold, second-best are underlined.
Model | Question Type | ChartâQuestion Relation | Overall | |||||
---|---|---|---|---|---|---|---|---|
Desc. | Reason | Open | Irrel. | Inexist | Contra. | Normal | ||
Proprietary Multimodal Large Language Models | ||||||||
Gemini-2.5-Pro | 60.31 | 50.31 | 36.69 | 36.80 | 52.62 | 49.52 | 58.58 | 49.34 |
GPT-5 (high) | 43.34 | 36.65 | 22.97 | 20.82 | 29.07 | 19.52 | 70.71 | 34.46 |
GPT-5 | 42.82 | 40.06 | 20.17 | 23.05 | 30.81 | 16.67 | 67.78 | 34.37 |
GPT-4o-2024-11-20 | 40.47 | 36.02 | 20.73 | 27.88 | 36.92 | 6.67 | 53.97 | 32.49 |
GPT-5-mini | 34.20 | 24.53 | 21.57 | 8.18 | 15.99 | 9.05 | 79.92 | 27.02 |
GPT-5-mini (high) | 33.42 | 25.78 | 20.45 | 8.55 | 15.99 | 11.90 | 75.73 | 26.74 |
GPT-5-nano (high) | 33.68 | 20.19 | 19.89 | 10.41 | 17.73 | 5.24 | 69.04 | 24.95 |
GPT-5-nano | 32.38 | 16.15 | 19.33 | 7.81 | 15.99 | 5.24 | 66.11 | 23.07 |
o4-mini (high) | 34.46 | 23.60 | 9.52 | 10.78 | 14.53 | 3.33 | 65.27 | 22.79 |
o4-mini | 32.90 | 22.67 | 10.92 | 11.52 | 12.21 | 1.90 | 67.36 | 22.41 |
Open-Source Multimodal Large Language Models | ||||||||
Llama-3.2-11B-Vision-Instruct | 37.60 | 20.19 | 15.69 | 29.37 | 28.49 | 0.00 | 36.82 | 24.95 |
InternVL2.5-4B | 19.58 | 9.63 | 8.12 | 4.83 | 11.34 | 0.00 | 34.73 | 12.71 |
InternVL2.5-8B | 17.23 | 8.39 | 10.64 | 3.35 | 6.40 | 0.00 | 41.84 | 12.34 |
InternVL2.5-38B | 28.72 | 18.32 | 10.36 | 10.04 | 21.80 | 0.00 | 43.51 | 19.40 |
InternVL2.5-78B | 31.33 | 17.08 | 14.85 | 13.75 | 26.45 | 0.00 | 41.84 | 21.47 |
Qwen2.5-VL-3B-Instruct | 13.05 | 4.04 | 7.84 | 0.74 | 0.87 | 0.00 | 35.98 | 8.57 |
Qwen2.5-VL-7B-Instruct | 38.38 | 25.47 | 12.32 | 25.65 | 25.29 | 1.90 | 47.28 | 25.71 |
Qwen2.5-VL-32B-Instruct | 36.81 | 25.47 | 8.68 | 21.56 | 25.29 | 1.43 | 44.35 | 23.92 |
Qwen2.5-VL-72B-Instruct | 58.49 | 59.63 | 44.82 | 65.06 | 69.77 | 19.05 | 50.63 | 54.24 |
Gallery
Teaser figures, qualitative examples, or a short demo clip. Use images below; click to zoom.




BibTeX
Please cite our work if you find it helpful.
@article{charthal,
title={{ChartHal}: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding},
author={Wang, Xingqi and Cui, Yiming and Yao, Xin and Wang, Shijin and Hu, Guoping and Qin, Xiaoyu},
year={2025},
eprint={2509.17481},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
Team
ChartHal is completed by a diverse team of researchers from Tsinghua University and iFLYTEK Research.





