The RepEval 2017 Shared Task

Introduction

RepEval 2017 features a shared task meant to evaluate natural language understanding models based on sentence encoders—that is, models that transform sentences into fixed-length vector representations and reason using those representations. The task will be natural language inference (also known as recognizing textual entailment, or RTE) in the style of SNLI—a three-class balanced classification problem over sentence pairs. The shared task will feature a new, dedicated dataset that spans several genres of text. Participation is open and new teams may join at any time.

The shared task includes two evaluations, a standard in-domain (matched) evaluation in which the training and test data are drawn from the same sources, and a cross-domain (mismatched) evaluation in which the training and test data differ substantially. This cross-domain evaluation will test the ability of submitted systems to learn representations of sentence meaning that capture broadly useful features.

Results

The following table presents the results as of the close of the competition on June 15. All numbers reflect accuracy on the hidden portion of the test set. Accepted papers will be made available in late summer.

Team Authors Matched Acc.  Mismatched Acc.
alpha (ensemble) Qian Chen et al. (arXiv) 74.9% 74.9%
YixinNie-UNC-NLP Yixin Nie and Mohit Bansal  (arXiv) 74.5% 73.5%
alpha Qian Chen et al. (arXiv) 73.5% 73.6%
Rivercorners (ensemble)  Jorge Balazs et al. (arXiv) 72.2% 72.8%
Rivercorners Jorge Balazs et al. (arXiv) 72.1% 72.1%
LCT-MALTA Hoa Vu et al. 70.7% 70.8%
TALP-UPC Han Yang et al. (arXiv) 67.9% 68.2%
BiLSTM baseline Williams et al. 67.0% 67.6%

Recap Paper

Competition Recap Paper (preprint)

The data

The training and development sections of the task data can be found here. The unlabeled test data is available through the Kaggle in Class competiton pages (matched, mismatched).

The task dataset (called the Multi-Genre NLI Corpus, or MultiNLI) consist of 393k training examples drawn from five genres of text, and 40k test and development examples drawn from those same five genres, as well as five more. Data collection for the task dataset is be closely modeled on SNLI, which is based on the genre of image captions, and may be used as additional training and development data, but will not be included in the evaluation.

Section name Training pairs  Dev pairs  Test pairs
Captions (SNLI Corpus)  (550,152) (10,000) (10,000)
Fiction 77,348 2,000 2,000
Government 77,350 2,000 2,000
Slate 77,306 2,000 2,000
Telephone Speech 83,348 2,000 2,000
Travel Guides 77,350 2,000 2,000
9/11 Report 0 2,000 2,000
Face-to-face Speech 0 2,000 2,000
Letters 0 2,000 2,000
Nonfiction Books (OUP) 0 2,000 2,000
Magazine (Verbatim) 0 2,000 2,000
Total 392,702 20,000 20,000

As in SNLI, each example will consist of two sentences and a label. The first sentence is drawn from a preexisting text source—either one of the sections of the Open American National Corpus (OpenANC) or some other permissively licensed source. The second sentence is written by crowd workers as part of data collection. Data for each genre will be collected in a separate crowdsourcing task. The labels will be entailment, neutral, and contradiction, in roughly equal proportions. Some examples from the corpus can be seen below.

Premise Label Hypothesis
Fiction
The Old One always comforted Ca'daan, except today. neutral Ca'daan knew the Old One very well.
Letters
Your gift is appreciated by each and every student who will benefit from your generosity. neutral Hundreds of students will benefit from your generosity.
Telephone Speech
yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or contradiction August is a black out month for vacations in the company.
9/11 Report
At the other end of Pennsylvania Avenue, people began to line up for a White House tour. entailment People formed a line at the end of Pennsylvania Avenue.

Data is available in the same two formats as SNLI: tab-separated values and jsonl. It will take the form of five files in each format: train, in-domain development, cross-domain development, in-domain test, and cross-domain test. Each individual example will be marked with a genre tag.

We are also separately distributing a small subset of the development data that has been manually annotated to facilitate error analysis.

Rules and evaluation

Evaluation

  • Evaluation will be done using the Kaggle platform. During the evaluation period, participants download an unlabeled copy of the test set, use their systems to predict labels, and upload those labels. Standard Kaggle rules apply.
  • Participants may submit to either or both of the two evaluations (in-domain or cross-domain).
  • Multiple submissions from the same team are allowed, up to a limit of two per day during the two-week evaluation period. Individual participants (i.e., PIs) may join multiple teams within reason, but only when each team reflects a fully independent engineering effort, and each team has a different lead developer. (Note: Kaggle may not allow entrants to formally join multiple teams. In this case, PIs and others spanning multiple teams should join at most a single Kaggle team and only disclose their memberships in their paper submission.)

Systems

  • This competition is meant to evaluate the quality of vector representations of sentences, and all submitted systems should include a bottleneck in which sentences are represented as fixed-length vectors with no explicitly-imposed internal structure. Typical attention and memory models that represent sentences as sets or sequences of vectors, though useful for tasks like NLI, are not eligible for inclusion in this competition. (It is allowed, should it be useful, to use two separate models to encode the two input sentences.)
  • The development sets are to be used for model selection and the tuning of reasonable hyperparameters. Models that are explicitly trained on the development data may be disqualified.
  • Models should make their predictions for each example independantly. It is the case that different pairs sharing the same premise typically have different labels (as an artificact of how we created the data), but systems are not allowed to exploit this signal to model joint distributions over multiple examples at once. (If you aren’t sure whether this applies to your system, it probably doesn’t.)
  • For inclusion in the workshop and the final leaderboard, participants will have to upload a code package that can be used to reproduce the submitted results. This code package will not be used as the primary means of evaluation, but it will be made public to encourage both reproducibility and future extension of submitted models.
  • For inclusion in the workshop and the final leaderboard, participants will also be asked to provide the sentence vectors produced by their best performing model. When you submit your system, you will also be asked to upload a link to a sentence vector file with lines of the form ‘pairID \t p or h (for premise or hypothesis) \t sentence representation vector as whitespace-delimited (tab or space) values. For example, a line might look like:
   123	p	0.27204 -0.06203 -0.1884 0.023225 -0.018158 0.0067192 -0.13877 0.17708 0.17709 ...

You should supply vectors for every sentence (premise and hypothesis) in the test set(s) for the competition(s) you’re submitting to. In addition, you are also asked to submit vectors for a set of additional probe sentences. These sentences are here, distributed in the same format as MultiNLI, but with one sentence per line, marked as the premise, and no hypothesis. All your vectors should be concatenated into a single file.

Outside data

  • The use of outside data is allowed, including raw unannotated text from any source (including OpenANC and the other sources from which MultiNLI was derived), word vector packages, and knowledge resources like WordNet.
  • All outside data used must be publicly available to allow for reproducibility. Widely-used data with restrictive licenses or licensing fees (such as LDC-distributed corpora) may be allowed at our discretion. Please inquire at the QA forum below.

Baselines

Model  Matched Test Acc.  Mismatched Test Acc.
Most Frequent Class  36.5 35.6
CBOW 65.2 64.6
BiLSTM 67.5 67.1

  • Note that the paper also presents results with an ESIM. That model relies on attention between sentences and would be ineligible for inclusion in this competition.

  • Both models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors.

  • Code (TensorFlow/Python) is available here.

Leaderboard and evaluation site

To participate in the competition, evaluate your system, or view the current mid-competiton leaderboard (including systems that may not qualify for the final leaderboard), use these two Kaggle in Class competitions:

Paper submission

For inclusion in the workshop and the final leaderboard, you must submit:

  • A system description paper of 2–4 pages in EMNLP format. System description papers will be reviewed for readability and soundness (but not novelty/technical merit) before acceptance.
  • A .zip code package (as a link from your paper) that can be used to reproduce the submitted results after being trained on widely-available data files.
  • A URL for a vector package, as discussed above.

Paper prepration and uploading instructions can be found in the Call for Papers.

Key dates

Join us!