ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.
Paper [Cui et al., 2022] BibTeX [Cui et al., 2022]Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):
To evaluate your models, we have also made available the evaluation script for official evaluation, with sample predictions on each subset.
To run the evaluation, use python eval_expmrc.py <path_to_dev> <path_to_predictions>
You may also be interested in a quick baseline system based on pre-trained language model, such as BERT. The codes are distributed under the Apache-2.0 license.
To preserve the integrity of test results, we do not release the test sets to the public. Instead, we require you to upload your model onto CodaLab, so that we can run it on the test sets for you. You can follow the instructions on CodaLab (which is similar to SQuAD, CMRC 2018 submission). Submission Tutorial
Please read the following before submission. Also, we strongly suggest the participants follow our baseline settings to ensure a fair comparison for academic purposes.
The leaderboard is updated once new submissions are included. The ranking is based on the "Overall F1", which considers both answer and evidence F1 (but not solely the product of them). The details of the evaluation process can be seen here.
Ask us questions at our GitHub repository or at expmrc [at] 126 [dot] com .
Explainability is a universal demand for various machine reading comprehension tasks. Most of the MRC systems yield near-human or over-human performance on solving these datasets, but will your system also surpass the humans on giving correct explanations as well?
SQuAD (EN) | CMRC 2018 (ZH) | RACE+ (EN) | C3 (ZH) |
---|
Rank | Model | Answer F1 |
Evidence F1 |
Overall F1 |
---|---|---|---|---|
Human Performance
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
91.3 | 92.9 | 84.7 | |
🥇 June 19, 2023 |
Anonymized Model (single model)
Anonymized Organization |
92.968 | 90.195 | 84.448 |
🥈 Oct 29, 2022 |
T5InterMRC (single model)
Guangxi Normal University |
90.237 | 90.859 | 84.079 |
🥉 June 6, 2023 |
Anonymized Model (single model)
Anonymized Organization |
92.968 | 89.626 | 83.973 |
4 May 11, 2021 |
BERT-large + PA Sent. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
92.300 | 89.600 | 83.600 |
5 May 23, 2023 |
WSQWevi-QA (single model)
Anonymized Organization |
91.074 | 90.195 | 83.489 |
6 May 11, 2021 |
BERT-large + MSS (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
92.300 | 85.700 | 80.400 |
7 May 16, 2023 |
WSQW-QA (single model)
Anonymized Organization |
91.074 | 86.882 | 80.356 |
8 May 11, 2021 |
BERT-base + PA Sent. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
87.100 | 89.100 | 79.600 |
9 May 11, 2021 |
BERT-base + MSS (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
87.100 | 85.400 | 76.100 |
Rank | Model | Answer F1 |
Evidence F1 |
Overall F1 |
---|---|---|---|---|
Human Performance
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
97.9 | 94.6 | 92.6 | |
🥇 Oct 30, 2022 |
MT-MacBert+DA (ensemble)
GammaLab |
91.979 | 85.191 | 79.028 |
🥈 May 23, 2023 |
WSQWevi-QA (single model)
Anonymized Organization |
92.223 | 80.220 | 74.839 |
🥉 May 16, 2023 |
WSQW-QA (single model)
Anonymized Organization |
92.223 | 79.786 | 74.452 |
4 Oct 30, 2022 |
MacBERT-large + Pseudo&EScore + DA (ensemble)
Shanxi University |
92.804 | 79.404 | 73.934 |
5 May 11, 2023 |
Anonymized Model + DA (single model)
Shanxi University |
90.668 | 79.431 | 72.507 |
6 June 6, 2023 |
Anonymized Model (single model)
Anonymized Organization |
92.223 | 77.411 | 72.102 |
7 May 11, 2021 |
MacBERT-large + PA Sent. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
88.600 | 70.600 | 63.300 |
8 May 11, 2021 |
MacBERT-large + MSS w/ Ques. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
88.600 | 71.000 | 63.200 |
9 May 11, 2021 |
MacBERT-base + MSS w/ Ques. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
84.400 | 69.800 | 59.900 |
10 May 11, 2021 |
MacBERT-base + PA Sent. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
84.400 | 69.100 | 59.800 |
Rank | Model | Answer Acc |
Evidence F1 |
Overall F1 |
---|---|---|---|---|
Human Performance
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
93.6 | 90.5 | 84.4 | |
🥇 Jul 4, 2023 |
SCL4E (single model)
YSU |
75.887 | 58.765 | 50.859 |
🥈 Sep 3, 2021 |
BERT-base+EveMRC (single model)
Shanghai Jiao Tong University |
66.667 | 52.542 | 40.739 |
🥉 May 3, 2023 |
STEXP+Macbert-base (single model)
YSU |
71.986 | 48.774 | 38.420 |
4 May 11, 2021 |
BERT-large + MSS w/ Ques. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
68.100 | 42.500 | 31.300 |
5 May 11, 2021 |
BERT-large + Pseudo-data (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
70.400 | 41.300 | 30.800 |
6 May 11, 2021 |
BERT-base + MSS w/ Ques. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
59.800 | 41.800 | 27.300 |
7 May 11, 2021 |
BERT-base + Pseudo-data (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
60.100 | 43.500 | 27.100 |
Rank | Model | Answer Acc |
Evidence F1 |
Overall F1 |
---|---|---|---|---|
Human Performance
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
94.3 | 97.7 | 90.0 | |
🥇 Oct 30, 2022 |
Ernie3-xbase+OptionExpMRC+DA (ensemble)
Shanxi University |
81.000 | 70.902 | 61.196 |
🥈 Aug 30, 2023 |
STEXP (single model)
YSU |
77.400 | 69.492 | 60.429 |
🥉 Oct 30, 2022 |
U3Ebase+Macbert-large (single model)
BIT [He et al., 2022] |
75.400 | 69.857 | 57.784 |
4 Jul 18, 2023 |
SCL4E (single model)
YSU |
74.600 | 68.712 | 57.138 |
5 May 3, 2023 |
STEXP+Macbert-large (single model)
YSU |
75.200 | 68.266 | 56.079 |
6 Jun 23, 2022 |
BERT-base+EveMRC (single model)
Shanghai Jiao Tong University |
67.600 | 65.307 | 51.886 |
7 Jun 28, 2023 |
Anonymized Model (single model)
Anonymized Organization |
66.600 | 63.584 | 48.826 |
8 May 11, 2021 |
MacBERT-large + Pseudo-data (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
74.400 | 59.900 | 47.300 |
9 Jun 13, 2022 |
MacBert-base + ABS2Pseudo-data (single model)
BIT |
65.600 | 60.262 | 46.239 |
10 May 11, 2021 |
MacBERT-large + MSS w/ Ques. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
72.000 | 58.400 | 46.000 |
11 Jun 20, 2022 |
MacBert-base + ABS2Pseudo-data (single model)
BIT |
64.200 | 58.885 | 43.890 |
12 May 11, 2021 |
MacBERT-base + MSS w/ Ques. (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
66.800 | 57.400 | 42.300 |
13 May 11, 2021 |
MacBERT-base + Pseudo-data (single model)
Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022] |
69.000 | 57.500 | 40.600 |