ExpMRC

What is ExpMRC?

ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE⁺ (similar to RACE), and C³, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.

Paper [Cui et al., 2022] BibTeX [Cui et al., 2022]

Getting Started

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

Download ExpMRC Development Set

To evaluate your models, we have also made available the evaluation script for official evaluation, with sample predictions on each subset. To run the evaluation, use python eval_expmrc.py <path_to_dev> <path_to_predictions>

ExpMRC Evaluation Script

Sample Prediction Files on Dev Set

You may also be interested in a quick baseline system based on pre-trained language model, such as BERT. The codes are distributed under the Apache-2.0 license.

Get Baseline Code

Official Submission

To preserve the integrity of test results, we do not release the test sets to the public. Instead, we require you to upload your model onto CodaLab, so that we can run it on the test sets for you. You can follow the instructions on CodaLab (which is similar to SQuAD, CMRC 2018 submission). Submission Tutorial

Submission Policy (IMPORTANT!)

Please read the following before submission. Also, we strongly suggest the participants follow our baseline settings to ensure a fair comparison for academic purposes.

You are free to use any open-source machine reading comprehension data or automatically generated data for training systems (both labeled and unlabeled).

You are NOT allowed to use any newly human-annotated data (which is not publicly available) for training, which violates our submission policy.

We do not encourage using the development set of ExpMRC for training (though it is not prohibited). Such submissions will be marked with an asterisk (*).

About Leaderboard

The leaderboard is updated once new submissions are included. The ranking is based on the "Overall F1", which considers both answer and evidence F1 (but not solely the product of them). The details of the evaluation process can be seen here.

Have Questions?

Ask us questions at our GitHub repository or at expmrc [at] 126 [dot] com .

ExpMRC

Leaderboard

Explainability is a universal demand for various machine reading comprehension tasks. Most of the MRC systems yield near-human or over-human performance on solving these datasets, but will your system also surpass the humans on giving correct explanations as well?

SQuAD (EN)		CMRC 2018 (ZH)		RACE⁺ (EN)		C³ (ZH)

Rank	Model	Answer F1	Evidence F1	Overall F1
	Human Performance Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	91.3	92.9	84.7
🥇 June 19, 2023	Anonymized Model (single model) Anonymized Organization	92.968	90.195	84.448
🥈 Oct 29, 2022	T5InterMRC (single model) Guangxi Normal University	90.237	90.859	84.079
🥉 June 6, 2023	Anonymized Model (single model) Anonymized Organization	92.968	89.626	83.973
4 May 11, 2021	BERT-large + PA Sent. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	92.300	89.600	83.600
5 May 23, 2023	WSQWevi-QA (single model) Anonymized Organization	91.074	90.195	83.489
6 May 11, 2021	BERT-large + MSS (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	92.300	85.700	80.400
7 May 16, 2023	WSQW-QA (single model) Anonymized Organization	91.074	86.882	80.356
8 May 11, 2021	BERT-base + PA Sent. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	87.100	89.100	79.600
9 May 11, 2021	BERT-base + MSS (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	87.100	85.400	76.100

Rank	Model	Answer F1	Evidence F1	Overall F1
	Human Performance Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	97.9	94.6	92.6
🥇 Oct 30, 2022	MT-MacBert+DA (ensemble) GammaLab	91.979	85.191	79.028
🥈 May 23, 2023	WSQWevi-QA (single model) Anonymized Organization	92.223	80.220	74.839
🥉 May 16, 2023	WSQW-QA (single model) Anonymized Organization	92.223	79.786	74.452
4 Oct 30, 2022	MacBERT-large + Pseudo&EScore + DA (ensemble) Shanxi University	92.804	79.404	73.934
5 May 11, 2023	Anonymized Model + DA (single model) Shanxi University	90.668	79.431	72.507
6 June 6, 2023	Anonymized Model (single model) Anonymized Organization	92.223	77.411	72.102
7 May 11, 2021	MacBERT-large + PA Sent. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	88.600	70.600	63.300
8 May 11, 2021	MacBERT-large + MSS w/ Ques. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	88.600	71.000	63.200
9 May 11, 2021	MacBERT-base + MSS w/ Ques. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	84.400	69.800	59.900
10 May 11, 2021	MacBERT-base + PA Sent. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	84.400	69.100	59.800

Rank	Model	Answer Acc	Evidence F1	Overall F1
	Human Performance Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	93.6	90.5	84.4
🥇 Jul 4, 2023	SCL4E (single model) YSU	75.887	58.765	50.859
🥈 Sep 3, 2021	BERT-base+EveMRC (single model) Shanghai Jiao Tong University	66.667	52.542	40.739
🥉 May 3, 2023	STEXP+Macbert-base (single model) YSU	71.986	48.774	38.420
4 May 11, 2021	BERT-large + MSS w/ Ques. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	68.100	42.500	31.300
5 May 11, 2021	BERT-large + Pseudo-data (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	70.400	41.300	30.800
6 May 11, 2021	BERT-base + MSS w/ Ques. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	59.800	41.800	27.300
7 May 11, 2021	BERT-base + Pseudo-data (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	60.100	43.500	27.100

Rank	Model	Answer Acc	Evidence F1	Overall F1
	Human Performance Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	94.3	97.7	90.0
🥇 Oct 30, 2022	Ernie3-xbase+OptionExpMRC+DA (ensemble) Shanxi University	81.000	70.902	61.196
🥈 Aug 30, 2023	STEXP (single model) YSU	77.400	69.492	60.429
🥉 Oct 30, 2022	U3Ebase+Macbert-large (single model) BIT [He et al., 2022]	75.400	69.857	57.784
4 Jul 18, 2023	SCL4E (single model) YSU	74.600	68.712	57.138
5 May 3, 2023	STEXP+Macbert-large (single model) YSU	75.200	68.266	56.079
6 Jun 23, 2022	BERT-base+EveMRC (single model) Shanghai Jiao Tong University	67.600	65.307	51.886
7 Jun 28, 2023	Anonymized Model (single model) Anonymized Organization	66.600	63.584	48.826
8 May 11, 2021	MacBERT-large + Pseudo-data (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	74.400	59.900	47.300
9 Jun 13, 2022	MacBert-base + ABS2Pseudo-data (single model) BIT	65.600	60.262	46.239
10 May 11, 2021	MacBERT-large + MSS w/ Ques. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	72.000	58.400	46.000
11 Jun 20, 2022	MacBert-base + ABS2Pseudo-data (single model) BIT	64.200	58.885	43.890
12 May 11, 2021	MacBERT-base + MSS w/ Ques. (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	66.800	57.400	42.300
13 May 11, 2021	MacBERT-base + Pseudo-data (single model) Joint Laboratory of HIT and iFLYTEK Research [Cui et al., 2022]	69.000	57.500	40.600

ExpMRC

Explainability Evaluation for Machine Reading Comprehension

What is ExpMRC?

Getting Started

Official Submission

Submission Policy (IMPORTANT!)

About Leaderboard

Have Questions?

Leaderboard