The RepEval 2017 Shared Task


RepEval 2017 features a shared task meant to evaluate natural language understanding models based on sentence encoders—that is, models that transform sentences into fixed-length vector representations and reason using those representations. The task will be natural language inference (also known as recognizing textual entailment, or RTE) in the style of SNLI—a three-class balanced classification problem over sentence pairs. The shared task will feature a new, dedicated dataset that spans several genres of text. Participation is open and new teams may join at any time.

The shared task will feature two evaluations, a standard in-domain evaluation in which the training and test data are drawn from the same sources, and a cross-domain evaluation in which the training and test data differ substantially. This cross-domain evaluation will test the ability of submitted systems to learn representations of sentence meaning that capture broadly useful features.

The data

A preliminary version of the training and development sections of the task data can be found here.

The task dataset (called the Multi-Genre NLI Corpus, or MultiNLI) consist of 393k training examples drawn from five genres of text, and 40k test and development examples drawn from those same five genres, as well as five more. Data collection for the task dataset is be closely modeled on SNLI, which is based on the genre of image captions, and may be used as additional training and development data, but will not be included in the evaluation.

Section name Training pairs Dev pairs Test pairs
Captions (SNLI Corpus) (550,152) (10,000) (10,000)
Fiction 77,348 2,000 2,000
Government 77,350 2,000 2,000
Slate 77,306 2,000 2,000
Telephone Speech 83,348 2,000 2,000
Travel Guides 77,350 2,000 2,000
9/11 Report 0 2,000 2,000
Face-to-face Speech 0 2,000 2,000
Letters 0 2,000 2,000
Nonfiction Books (OUP) 0 2,000 2,000
Magazine (Verbatim) 0 2,000 2,000
Total 392,702 20,000 20,000

As in SNLI, each example will consist of two sentences and a label. The first sentence is drawn from a preexisting text source—either one of the sections of the Open American National Corpus (OpenANC) or some other permissively licensed source. The second sentence is written by crowd workers as part of data collection. Data for each genre will be collected in a separate crowdsourcing task. The labels will be entailment, neutral, and contradiction, in roughly equal proportions. Some examples from the corpus can be seen below.

Premise Label Hypothesis
The Old One always comforted Ca'daan, except today. neutral Ca'daan knew the Old One very well.
Your gift is appreciated by each and every student who will benefit from your generosity. neutral Hundreds of students will benefit from your generosity.
Telephone Speech
yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or contradiction August is a black out month for vacations in the company.
9/11 Report
At the other end of Pennsylvania Avenue, people began to line up for a White House tour. entailment People formed a line at the end of Pennsylvania Avenue.

Data is available in the same two formats as SNLI: tab-separated values and jsonl. It will take the form of five files in each format: train, in-domain development, cross-domain development, in-domain test, and cross-domain test. Each individual example will be marked with a genre tag.

We are also separately distributing a small subset of the development data that has been manually annotated to facilitate error analysis.

Rules and evaluation


  • Evaluation will be done using the Kaggle platform. During the evaluation period, submitters download an unlabeled copy of the test set, use their systems to predict labels, and upload those labels. Standard Kaggle rules apply.
  • Submitters may submit to either or both of the two evaluations (in-domain or cross-domain).
  • Multiple submissions from the same team are allowed, up to a limit of two per day during the two-week evaluation period. Individual participants (i.e., PIs) may join multiple teams within reason, but only when each team reflects a fully independent engineering effort, and each team has a different lead developer. (Note: Kaggle may not allow entrants to formally join multiple teams. In this case, PIs and others spanning multiple teams should join at most a single Kaggle team and only disclose their memberships in their paper submission.)


  • This competition is meant to evaluate the quality of vector representations of sentences, and all submitted systems should include a bottleneck in which sentences are represented as fixed-length vectors with no explicitly-imposed internal structure. Typical attention and memory models that represent sentences as sets or sequences of vectors, though useful for tasks like NLI, are not eligible for inclusion in this competition. (It is allowed, should it be useful, to use two separate models to encode the two input sentences.)
  • The development sets are to be used for model selection and the tuning of reasonable hyperparameters. Models that are explicitly trained on the development data may be disqualified.
  • Models should make their predictions for each example independantly. It is the case that different pairs sharing the same premise typically have different labels (as an artificact of how we created the data), but systems are not allowed to exploit this signal to model joint distributions over multiple examples at once. (If you aren’t sure whether this applies to your system, it probably doesn’t.)
  • For inclusion in the workshop and the final leaderboard, submitters will have to upload a code packagethat can be used to reproduce the submitted results. This code package will not be used as the primary means of evaluation, but it will be made public to encourage both reproducibility and future extension of submitted models. Moreover, for a limited set of sentences, participants will be asked to provide the sentence vectors produced by their best performing model. These vectors will be used in our own analysis of the results.

Outside data

  • The use of outside data is allowed, including raw unannotated text from any source, word vector packages, and knowledge resources like WordNet are explicitly permitted. We will provide links to unlabeled OpenANC data that reflects the target genres.
  • All outside data used must be publicly available to allow for reproducibility. Widely-used data with restrictive licenses or licensing fees (such as LDC-distributed corpora) may be allowed at our discretion. Please inquire at the QA forum below.


Model  Matched Test Acc.  Mismatched Test Acc.
Most Frequent Class 36.5 35.6
CBOW 65.2 64.6
BiLSTM 67.5 67.1

  • Note that the paper also presents results with an ESIM. That model relies on attention between sentences and would be ineligible for inclusion in this competition.

  • Both models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors.

  • Code (TensorFlow/Python) is available here.

Paper submission

  • For inclusion in the workshop and the final leaderboard, you must submit:
  • A system description paper of 2–4 pages in ACL format. System description papers will be reviewed for readability and soundness (but not novelty/technical merit) before acceptance.
  • A .zip code package that can be used to reproduce the submitted results after being trained on widely-available data files.

Key dates

  • March 24: Training and development data and draft data description paper available, competition begins
  • By May 15: Expert-tagged development data for error analysis available
  • June 1: Unlabeled test data available, evaluation period begins, Kaggle evaluation site opens
  • June 14 (GMT-11, 23:59:59): Evaluation period ends, system description papers and code packages due
  • June 16: Winners formally announced
  • July 3 (GMT-11, 23:59:59): Reviews due
  • July 6: Notification of presentation acceptance
  • July 21 (GMT-11, 23:59:59): Camera ready papers due
  • September 8: Workshop at EMNLP 2017, Copenhagen: shared task poster session and selected short talks

Join us!