ArXiv Pre-Print · 2025

🌀 ChartHal

A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding

  • Xingqi Wang1∗
  • Yiming Cui2✉
  • Xin Yao2
  • Shijin Wang2
  • Guoping Hu2
  • Xiaoyu Qin1✉
1 Dept. of Computer Science and Technology, Tsinghua University 2 State Key Laboratory of Cognitive Intelligence, iFLYTEK
∗ Work done during internship at iFLYTEK · ✉ Email correspondings
Project teaser image

Abstract

Large Vision-Language Models (LVLMs) have recently demonstrated remarkable progress, yet hallucination remains a critical barrier, particularly in chart understanding, which requires sophisticated perceptual and cognitive abilities as well as rigorous factual accuracy. While prior work has investigated hallucinations and chart comprehension independently, their intersection remains largely unexplored. To address this gap, we present ChartHal, a benchmark that features a fine-grained taxonomy of hallucination scenarios in chart understanding, along with a human-validated dataset of 1,062 samples. Our evaluation shows that state-of-the-art LVLMs suffer from severe hallucinations on ChartHal, including proprietary models such as GPT-5 and o4-mini, which achieve only 34.46% and 22.79% accuracy, respectively. Further analysis reveals that questions involving information absent from or contradictory to charts are especially likely to trigger hallucinations, underscoring the urgent need for more robust mitigation strategies.

  • ✓Introduces ChartHal, a benchmark targeting hallucinations in chart understanding.
  • ✓Provides a fine-grained taxonomy and a human-validated chart dataset aligned to it.
  • ✓Goes beyond accuracy with metrics for hallucination rate and severity.
  • ✓Finds persistent hallucinations in SOTA LVLMs, especially on complex charts; calls for targeted mitigation.

Dataset Preview

Four categories in ChartHal: Irrelevant, Inexist, Contradictory, and Normal. Along with question and reference answer, we also provide GPT-5 response. Please click each tab to explore the dataset.

Question: Based on the positions of the ap, sm, fi, and ni markers across the three λL curves, which group experienced the highest average monthly Twitter follower growth over the period represented in the chart?

Reference: The chart does not contain any information about Twitter follower counts or their growth rates. Therefore, no inference about average monthly Twitter follower growth can be made from this data.

GPT-5: fi. The fi markers move the farthest upward/right across the three λL curves, indicating the largest increase in monthly follower growth over the period. ❌ Incorrect

Results

The followings are the performance of different question type and chart–question relation on ChatHal. Scores are scaled to [0, 100]. Best values are in bold, second-best are underlined.

Model Question Type Chart–Question Relation Overall
Desc. Reason Open Irrel. Inexist Contra. Normal
Proprietary Multimodal Large Language Models
Gemini-2.5-Pro 60.31 50.31 36.69 36.80 52.62 49.52 58.58 49.34
GPT-5 (high) 43.34 36.65 22.97 20.82 29.07 19.52 70.71 34.46
GPT-5 42.82 40.06 20.17 23.05 30.81 16.67 67.78 34.37
GPT-4o-2024-11-20 40.47 36.02 20.73 27.88 36.92 6.67 53.97 32.49
GPT-5-mini 34.20 24.53 21.57 8.18 15.99 9.05 79.92 27.02
GPT-5-mini (high) 33.42 25.78 20.45 8.55 15.99 11.90 75.73 26.74
GPT-5-nano (high) 33.68 20.19 19.89 10.41 17.73 5.24 69.04 24.95
GPT-5-nano 32.38 16.15 19.33 7.81 15.99 5.24 66.11 23.07
o4-mini (high) 34.46 23.60 9.52 10.78 14.53 3.33 65.27 22.79
o4-mini 32.90 22.67 10.92 11.52 12.21 1.90 67.36 22.41
Open-Source Multimodal Large Language Models
Llama-3.2-11B-Vision-Instruct 37.60 20.19 15.69 29.37 28.49 0.00 36.82 24.95
InternVL2.5-4B 19.58 9.63 8.12 4.83 11.34 0.00 34.73 12.71
InternVL2.5-8B 17.23 8.39 10.64 3.35 6.40 0.00 41.84 12.34
InternVL2.5-38B 28.72 18.32 10.36 10.04 21.80 0.00 43.51 19.40
InternVL2.5-78B 31.33 17.08 14.85 13.75 26.45 0.00 41.84 21.47
Qwen2.5-VL-3B-Instruct 13.05 4.04 7.84 0.74 0.87 0.00 35.98 8.57
Qwen2.5-VL-7B-Instruct 38.38 25.47 12.32 25.65 25.29 1.90 47.28 25.71
Qwen2.5-VL-32B-Instruct 36.81 25.47 8.68 21.56 25.29 1.43 44.35 23.92
Qwen2.5-VL-72B-Instruct 58.49 59.63 44.82 65.06 69.77 19.05 50.63 54.24

BibTeX

Please cite our work if you find it helpful.

@article{charthal,
      title={{ChartHal}: A Fine-grained Framework Evaluating Hallucination of Large Vision Language Models in Chart Understanding}, 
      author={Wang, Xingqi and Cui, Yiming and Yao, Xin and Wang, Shijin and Hu, Guoping and Qin, Xiaoyu},
      year={2025},
      eprint={2509.17481},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Team

ChartHal is completed by a diverse team of researchers from Tsinghua University and iFLYTEK Research.

Person 1
Xingqi Wang
Tsinghua University
Person 2
Yiming Cui
iFLYTEK Research
Person 3
Xin Yao
iFLYTEK Research
Person 3
Shijin Wang
iFLYTEK Research
Person 3
Guoping Hu
iFLYTEK Research
Person 3
Xiaoyu Qin
Tsinghua University