ExpMRC

Explainability Evaluation for Machine Reading Comprehension

What is ExpMRC?

ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.

Paper [Cui et al., 2022] BibTeX [Cui et al., 2022]

Getting Started

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

To evaluate your models, we have also made available the evaluation script for official evaluation, with sample predictions on each subset. To run the evaluation, use python eval_expmrc.py <path_to_dev> <path_to_predictions>

You may also be interested in a quick baseline system based on pre-trained language model, such as BERT. The codes are distributed under the Apache-2.0 license.

Official Submission

To preserve the integrity of test results, we do not release the test sets to the public. Instead, we require you to upload your model onto CodaLab, so that we can run it on the test sets for you. You can follow the instructions on CodaLab (which is similar to SQuAD, CMRC 2018 submission). Submission Tutorial

Submission Policy (IMPORTANT!)

Please read the following before submission. Also, we strongly suggest the participants follow our baseline settings to ensure a fair comparison for academic purposes.

  • You are free to use any open-source machine reading comprehension data or automatically generated data for training systems (both labeled and unlabeled).
  • You are NOT allowed to use any newly human-annotated data (which is not publicly available) for training, which violates our submission policy.
  • We do not encourage using the development set of ExpMRC for training (though it is not prohibited). Such submissions will be marked with an asterisk (*).
  • About Leaderboard

    The leaderboard is updated once new submissions are included. The ranking is based on the "Overall F1", which considers both answer and evidence F1 (but not solely the product of them). The details of the evaluation process can be seen here.

    Have Questions?

    Ask us questions at our GitHub repository or at expmrc [at] 126 [dot] com .

    Leaderboard

    Explainability is a universal demand for various machine reading comprehension tasks. Most of the MRC systems yield near-human or over-human performance on solving these datasets, but will your system also surpass the humans on giving correct explanations as well?

    SQuAD (EN) CMRC 2018 (ZH) RACE+ (EN) C3 (ZH)
    Rank Model Answer
    F1
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    91.3 92.9 84.7

    🥇

    June 19, 2023
    Anonymized Model (single model)

    Anonymized Organization

    92.968 90.195 84.448

    🥈

    Oct 29, 2022
    T5InterMRC (single model)

    Guangxi Normal University

    90.237 90.859 84.079

    🥉

    June 6, 2023
    Anonymized Model (single model)

    Anonymized Organization

    92.968 89.626 83.973

    4

    May 11, 2021
    BERT-large + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    92.300 89.600 83.600

    5

    May 23, 2023
    WSQWevi-QA (single model)

    Anonymized Organization

    91.074 90.195 83.489

    6

    May 11, 2021
    BERT-large + MSS (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    92.300 85.700 80.400

    7

    May 16, 2023
    WSQW-QA (single model)

    Anonymized Organization

    91.074 86.882 80.356

    8

    May 11, 2021
    BERT-base + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    87.100 89.100 79.600

    9

    May 11, 2021
    BERT-base + MSS (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    87.100 85.400 76.100
    Rank Model Answer
    F1
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    97.9 94.6 92.6

    🥇

    Oct 30, 2022
    MT-MacBert+DA (ensemble)

    GammaLab

    91.979 85.191 79.028

    🥈

    May 23, 2023
    WSQWevi-QA (single model)

    Anonymized Organization

    92.223 80.220 74.839

    🥉

    May 16, 2023
    WSQW-QA (single model)

    Anonymized Organization

    92.223 79.786 74.452

    4

    Oct 30, 2022
    MacBERT-large + Pseudo&EScore + DA (ensemble)

    Shanxi University

    92.804 79.404 73.934

    5

    May 11, 2023
    Anonymized Model + DA (single model)

    Shanxi University

    90.668 79.431 72.507

    6

    June 6, 2023
    Anonymized Model (single model)

    Anonymized Organization

    92.223 77.411 72.102

    7

    May 11, 2021
    MacBERT-large + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    88.600 70.600 63.300

    8

    May 11, 2021
    MacBERT-large + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    88.600 71.000 63.200

    9

    May 11, 2021
    MacBERT-base + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    84.400 69.800 59.900

    10

    May 11, 2021
    MacBERT-base + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    84.400 69.100 59.800
    Rank Model Answer
    Acc
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    93.6 90.5 84.4

    🥇

    Jul 4, 2023
    SCL4E (single model)

    YSU

    75.887 58.765 50.859

    🥈

    Sep 3, 2021
    BERT-base+EveMRC (single model)

    Shanghai Jiao Tong University

    66.667 52.542 40.739

    🥉

    May 3, 2023
    STEXP+Macbert-base (single model)

    YSU

    71.986 48.774 38.420

    4

    May 11, 2021
    BERT-large + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    68.100 42.500 31.300

    5

    May 11, 2021
    BERT-large + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    70.400 41.300 30.800

    6

    May 11, 2021
    BERT-base + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    59.800 41.800 27.300

    7

    May 11, 2021
    BERT-base + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    60.100 43.500 27.100
    Rank Model Answer
    Acc
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    94.3 97.7 90.0

    🥇

    Oct 30, 2022
    Ernie3-xbase+OptionExpMRC+DA (ensemble)

    Shanxi University

    81.000 70.902 61.196

    🥈

    Aug 30, 2023
    STEXP (single model)

    YSU

    77.400 69.492 60.429

    🥉

    Oct 30, 2022
    U3Ebase+Macbert-large (single model)

    BIT

    [He et al., 2022]
    75.400 69.857 57.784

    4

    Jul 18, 2023
    SCL4E (single model)

    YSU

    74.600 68.712 57.138

    5

    May 3, 2023
    STEXP+Macbert-large (single model)

    YSU

    75.200 68.266 56.079

    6

    Jun 23, 2022
    BERT-base+EveMRC (single model)

    Shanghai Jiao Tong University

    67.600 65.307 51.886

    7

    Jun 28, 2023
    Anonymized Model (single model)

    Anonymized Organization

    66.600 63.584 48.826

    8

    May 11, 2021
    MacBERT-large + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    74.400 59.900 47.300

    9

    Jun 13, 2022
    MacBert-base + ABS2Pseudo-data (single model)

    BIT

    65.600 60.262 46.239

    10

    May 11, 2021
    MacBERT-large + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    72.000 58.400 46.000

    11

    Jun 20, 2022
    MacBert-base + ABS2Pseudo-data (single model)

    BIT

    64.200 58.885 43.890

    12

    May 11, 2021
    MacBERT-base + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    66.800 57.400 42.300

    13

    May 11, 2021
    MacBERT-base + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2022]
    69.000 57.500 40.600