ExpMRC

Explainability Evaluation for Machine Reading Comprehension

What is ExpMRC?

ExpMRC is a benchmark for the Explainability evaluation of Machine Reading Comprehension. ExpMRC contains four subsets of popular MRC datasets with additionally annotated evidences, including SQuAD, CMRC 2018, RACE+ (similar to RACE), and C3, covering span-extraction and multiple-choice questions MRC tasks in both English and Chinese.

ExpMRC paper [Cui et al., 2021]

Getting Started

Download a copy of the dataset (distributed under the CC BY-SA 4.0 license):

To evaluate your models, we have also made available the evaluation script for official evaluation, with sample predictions on each subset. To run the evaluation, use python eval_expmrc.py <path_to_dev> <path_to_predictions>

You may also be interested in a quick baseline system based on pre-trained language model, such as BERT. The codes are distributed under the Apache-2.0 license.

Official Submission

To preserve the integrity of test results, we do not release the test sets to the public. Instead, we require you to upload your model onto CodaLab, so that we can run it on the test sets for you. You can follow the instructions on CodaLab (which is similar to SQuAD, CMRC 2018 submission). Submission Tutorial

Submission Policy (IMPORTANT!)

Please read the following before submission. Also, we strongly suggest the participants follow our baseline settings to ensure a fair comparison for academic purposes.

  • You are free to use any open-source machine reading comprehension data or automatically generated data for training their systems (both labeled and unlabeled).
  • You are NOT allowed to use any newly human-annotated data (which is not publicly available) for training, which violates our submission policy.
  • We do not encourage using the development set of ExpMRC for training. Such submissions will be marked with an asterisk (*).
  • About Leaderboard

    The leaderboard is updated once new submissions are included. The ranking is based on the "Overall F1", which considers both answer and evidence F1 (but not solely the product of them). The details of the evaluation process can be seen here.

    Have Questions?

    Ask us questions at our GitHub repository or at expmrc [at] 126 [dot] com .

    Leaderboard

    Explainability is a universal demand for various machine reading comprehension tasks. Most of the MRC systems yield near-human or over-human performance on solving these datasets, but will your system also surpass the humans on giving correct explanations as well?

    SQuAD (EN) CMRC 2018 (ZH) RACE+ (EN) C3 (ZH)
    Rank Model Answer
    F1
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2021]
    91.3 92.9 84.7

    1

    May 11, 2021
    BERT-large + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    92.300 89.600 83.600

    2

    May 11, 2021
    BERT-large + MSS (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    92.300 85.700 80.400

    3

    May 11, 2021
    BERT-base + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    87.100 89.100 79.600

    4

    May 11, 2021
    BERT-base + MSS (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    87.100 85.400 76.100
    Rank Model Answer
    F1
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2021]
    97.9 94.6 92.6

    1

    May 11, 2021
    MacBERT-large + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    88.600 70.600 63.300

    2

    May 11, 2021
    MacBERT-large + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    88.600 71.000 63.200

    3

    May 11, 2021
    MacBERT-base + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    84.400 69.800 59.900

    4

    May 11, 2021
    MacBERT-base + PA Sent. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    84.400 69.100 59.800
    Rank Model Answer
    Acc
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2021]
    93.6 90.5 84.4

    1

    Sep 3, 2021
    BERT-base+EveMRC (single model)

    Shanghai Jiao Tong University

    66.667 52.542 40.739

    2

    May 11, 2021
    BERT-large + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    68.100 42.500 31.300

    3

    May 11, 2021
    BERT-large + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    70.400 41.300 30.800

    4

    May 11, 2021
    BERT-base + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    59.800 41.800 27.300

    5

    May 11, 2021
    BERT-base + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    60.100 43.500 27.100
    Rank Model Answer
    Acc
    Evidence
    F1
    Overall
    F1
    Human Performance

    Joint Laboratory of HIT and iFLYTEK Research

    [Cui et al., 2021]
    94.3 97.7 90.0

    1

    May 11, 2021
    MacBERT-large + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    74.400 59.900 47.300

    2

    May 11, 2021
    MacBERT-large + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    72.000 58.400 46.000

    3

    May 11, 2021
    MacBERT-base + MSS w/ Ques. (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    66.800 57.400 42.300

    4

    May 11, 2021
    MacBERT-base + Pseudo-data (single model)

    Joint Laboratory of HIT and iFLYTEK Research

    https://arxiv.org/abs/2105.04126
    69.000 57.500 40.600