RepEval2017

The RepEval 2017 Shared Task

Introduction

RepEval 2017 features a shared task meant to evaluate natural language understanding models based on sentence encoders—that is, models that transform sentences into fixed-length vector representations and reason using those representations. The task will be natural language inference (also known as recognizing textual entailment, or RTE) in the style of SNLI—a three-class balanced classification problem over sentence pairs. The shared task will feature a new, dedicated dataset that spans several genres of text. Participation is open and new teams may join at any time.

The shared task includes two evaluations, a standard in-domain (matched) evaluation in which the training and test data are drawn from the same sources, and a cross-domain (mismatched) evaluation in which the training and test data differ substantially. This cross-domain evaluation will test the ability of submitted systems to learn representations of sentence meaning that capture broadly useful features.

Results

The following table presents the results as of the close of the competition on June 15. All numbers reflect accuracy on the hidden portion of the test set.

Team	Authors	Matched Acc.	Mismatched Acc.
alpha (ensemble)	Qian Chen et al.	74.9%	74.9%
YixinNie-UNC-NLP	Yixin Nie and Mohit Bansal	74.5%	73.5%
alpha	Qian Chen et al.	73.5%	73.6%
Rivercorners (ensemble)	Jorge Balazs et al.	72.2%	72.8%
Rivercorners	Jorge Balazs et al.	72.1%	72.1%
LCT-MALTA	Hoa Vu et al.	70.7%	70.8%
TALP-UPC	Han Yang et al.	67.9%	68.2%
BiLSTM baseline	Williams et al.	67.0%	67.6%

Recap Paper

Competition Recap Paper

The data

The training and development sections of the task data can be found here. The unlabeled test data is available through the Kaggle in Class competiton pages (matched, mismatched).

The task dataset (called the Multi-Genre NLI Corpus, or MultiNLI) consist of 393k training examples drawn from five genres of text, and 40k test and development examples drawn from those same five genres, as well as five more. Data collection for the task dataset is be closely modeled on SNLI, which is based on the genre of image captions, and may be used as additional training and development data, but will not be included in the evaluation.

Section name	Training pairs	Dev pairs	Test pairs
Captions (SNLI Corpus)	(550,152)	(10,000)	(10,000)
Fiction	77,348	2,000	2,000
Government	77,350	2,000	2,000
Slate	77,306	2,000	2,000
Telephone Speech	83,348	2,000	2,000
Travel Guides	77,350	2,000	2,000
9/11 Report	0	2,000	2,000
Face-to-face Speech	0	2,000	2,000
Letters	0	2,000	2,000
Nonfiction Books (OUP)	0	2,000	2,000
Magazine (Verbatim)	0	2,000	2,000
Total	392,702	20,000	20,000

As in SNLI, each example will consist of two sentences and a label. The first sentence is drawn from a preexisting text source—either one of the sections of the Open American National Corpus (OpenANC) or some other permissively licensed source. The second sentence is written by crowd workers as part of data collection. Data for each genre will be collected in a separate crowdsourcing task. The labels will be entailment, neutral, and contradiction, in roughly equal proportions. Some examples from the corpus can be seen below.

Premise	Label	Hypothesis
Fiction
The Old One always comforted Ca'daan, except today.	neutral	Ca'daan knew the Old One very well.
Letters
Your gift is appreciated by each and every student who will benefit from your generosity.	neutral	Hundreds of students will benefit from your generosity.
Telephone Speech
yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or	contradiction	August is a black out month for vacations in the company.
9/11 Report
At the other end of Pennsylvania Avenue, people began to line up for a White House tour.	entailment	People formed a line at the end of Pennsylvania Avenue.

Data is available in the same two formats as SNLI: tab-separated values and jsonl. It will take the form of five files in each format: train, in-domain development, cross-domain development, in-domain test, and cross-domain test. Each individual example will be marked with a genre tag.

We are also separately distributing a small subset of the development data that has been manually annotated to facilitate error analysis.

Rules and evaluation

Evaluation

Evaluation will be done using the Kaggle platform. During the evaluation period, participants download an unlabeled copy of the test set, use their systems to predict labels, and upload those labels. Standard Kaggle rules apply.
Participants may submit to either or both of the two evaluations (in-domain or cross-domain).
Multiple submissions from the same team are allowed, up to a limit of two per day during the two-week evaluation period. Individual participants (i.e., PIs) may join multiple teams within reason, but only when each team reflects a fully independent engineering effort, and each team has a different lead developer. (Note: Kaggle may not allow entrants to formally join multiple teams. In this case, PIs and others spanning multiple teams should join at most a single Kaggle team and only disclose their memberships in their paper submission.)

Systems

This competition is meant to evaluate the quality of vector representations of sentences, and all submitted systems should include a bottleneck in which sentences are represented as fixed-length vectors with no explicitly-imposed internal structure. Typical attention and memory models that represent sentences as sets or sequences of vectors, though useful for tasks like NLI, are not eligible for inclusion in this competition. (It is allowed, should it be useful, to use two separate models to encode the two input sentences.)
The development sets are to be used for model selection and the tuning of reasonable hyperparameters. Models that are explicitly trained on the development data may be disqualified.
Models should make their predictions for each example independantly. It is the case that different pairs sharing the same premise typically have different labels (as an artificact of how we created the data), but systems are not allowed to exploit this signal to model joint distributions over multiple examples at once. (If you aren’t sure whether this applies to your system, it probably doesn’t.)
For inclusion in the workshop and the final leaderboard, participants will have to upload a code package that can be used to reproduce the submitted results. This code package will not be used as the primary means of evaluation, but it will be made public to encourage both reproducibility and future extension of submitted models.
For inclusion in the workshop and the final leaderboard, participants will also be asked to provide the sentence vectors produced by their best performing model. When you submit your system, you will also be asked to upload a link to a sentence vector file with lines of the form ‘pairID \t p or h (for premise or hypothesis) \t sentence representation vector as whitespace-delimited (tab or space) values. For example, a line might look like:

   123	p	0.27204 -0.06203 -0.1884 0.023225 -0.018158 0.0067192 -0.13877 0.17708 0.17709 ...

You should supply vectors for every sentence (premise and hypothesis) in the test set(s) for the competition(s) you’re submitting to. In addition, you are also asked to submit vectors for a set of additional probe sentences. These sentences are here, distributed in the same format as MultiNLI, but with one sentence per line, marked as the premise, and no hypothesis. All your vectors should be concatenated into a single file.

Outside data

The use of outside data is allowed, including raw unannotated text from any source (including OpenANC and the other sources from which MultiNLI was derived), word vector packages, and knowledge resources like WordNet.
All outside data used must be publicly available to allow for reproducibility. Widely-used data with restrictive licenses or licensing fees (such as LDC-distributed corpora) may be allowed at our discretion. Please inquire at the QA forum below.

Baselines

The corpus description paper presents the following baselines:

Model	Matched Test Acc.	Mismatched Test Acc.
Most Frequent Class	36.5	35.6
CBOW	65.2	64.6
BiLSTM	67.5	67.1

Note that the paper also presents results with an ESIM. That model relies on attention between sentences and would be ineligible for inclusion in this competition.
Both models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors.
Code (TensorFlow/Python) is available here.

Leaderboard and evaluation site

To participate in the competition, evaluate your system, or view the current mid-competiton leaderboard (including systems that may not qualify for the final leaderboard), use these two Kaggle in Class competitions:

Paper submission

For inclusion in the workshop and the final leaderboard, you must submit:

A system description paper of 2–4 pages in EMNLP format. System description papers will be reviewed for readability and soundness (but not novelty/technical merit) before acceptance.
A .zip code package (as a link from your paper) that can be used to reproduce the submitted results after being trained on widely-available data files.
A URL for a vector package, as discussed above.

Paper prepration and uploading instructions can be found in the Call for Papers.

Key dates

March 24: Training and development data and draft data description paper available, competition begins
By May 15: Expert-tagged development data for error analysis available
June 1: Unlabeled test data available, evaluation period begins, Kaggle evaluation site opens
June 14 (GMT-11, 23:59:59): Evaluation period ends, system description papers and code packages due
June 16: Results announced (see above)
July 3 (GMT-11, 23:59:59): Reviews due
July 6: Notification of presentation acceptance
July 21 (GMT-11, 23:59:59): Camera ready papers due
September 8: Workshop at EMNLP 2017, Copenhagen: shared task poster session and selected short talks

Join us!

Announcements list (all participants should subscribe): Google Group
Discussion/FAQ forum (ask questions here first): Google Forum
Private questions: Sam Bowman