The RepEval 2017 Shared Task
RepEval 2017 features a shared task meant to evaluate natural language understanding models based on sentence encoders—that is, models that transform sentences into fixed-length vector representations and reason using those representations. The task will be natural language inference (also known as recognizing textual entailment, or RTE) in the style of SNLI—a three-class balanced classification problem over sentence pairs. The shared task will feature a new, dedicated dataset that spans several genres of text. Participation is open and new teams may join at any time.
The shared task will feature two evaluations, a standard in-domain evaluation in which the training and test data are drawn from the same sources, and a cross-domain evaluation in which the training and test data differ substantially. This cross-domain evaluation will test the ability of submitted systems to learn representations of sentence meaning that capture broadly useful features.
A preliminary version of the training and development sections of the task data can be found here.
The task dataset (called the Multi-Genre NLI Corpus, or MultiNLI) consist of 393k training examples drawn from five genres of text, and 40k test and development examples drawn from those same five genres, as well as five more. Data collection for the task dataset is be closely modeled on SNLI, which is based on the genre of image captions, and may be used as additional training and development data, but will not be included in the evaluation.
|Section name||Training pairs||Dev pairs||Test pairs|
|Captions (SNLI Corpus)||(550,152)||(10,000)||(10,000)|
|Nonfiction Books (OUP)||0||2,000||2,000|
As in SNLI, each example will consist of two sentences and a label. The first sentence is drawn from a preexisting text source—either one of the sections of the Open American National Corpus (OpenANC) or some other permissively licensed source. The second sentence is written by crowd workers as part of data collection. Data for each genre will be collected in a separate crowdsourcing task. The labels will be entailment, neutral, and contradiction, in roughly equal proportions. Some examples from the corpus can be seen below.
|The Old One always comforted Ca'daan, except today.||neutral||Ca'daan knew the Old One very well.|
|Your gift is appreciated by each and every student who will benefit from your generosity.||neutral||Hundreds of students will benefit from your generosity.|
|yes now you know if if everybody like in August when everybody's on vacation or something we can dress a little more casual or||contradiction||August is a black out month for vacations in the company.|
|At the other end of Pennsylvania Avenue, people began to line up for a White House tour.||entailment||People formed a line at the end of Pennsylvania Avenue.|
Data is available in the same two formats as SNLI: tab-separated values and jsonl. It will take the form of five files in each format: train, in-domain development, cross-domain development, in-domain test, and cross-domain test. Each individual example will be marked with a genre tag.
We are also separately distributing a small subset of the development data that has been manually annotated to facilitate error analysis.
Rules and evaluation
- Evaluation will be done using the Kaggle platform. During the evaluation period, submitters download an unlabeled copy of the test set, use their systems to predict labels, and upload those labels. Standard Kaggle rules apply.
- Submitters may submit to either or both of the two evaluations (in-domain or cross-domain).
- Multiple submissions from the same team are allowed, up to a limit of two per day during the two-week evaluation period. Individual participants (i.e., PIs) may join multiple teams within reason, but only when each team reflects a fully independent engineering effort, and each team has a different lead developer. (Note: Kaggle may not allow entrants to formally join multiple teams. In this case, PIs and others spanning multiple teams should join at most a single Kaggle team and only disclose their memberships in their paper submission.)
- This competition is meant to evaluate the quality of vector representations of sentences, and all submitted systems should include a bottleneck in which sentences are represented as fixed-length vectors with no explicitly-imposed internal structure. Typical attention and memory models that represent sentences as sets or sequences of vectors, though useful for tasks like NLI, are not eligible for inclusion in this competition. (It is allowed, should it be useful, to use two separate models to encode the two input sentences.)
- The development sets are to be used for model selection and the tuning of reasonable hyperparameters. Models that are explicitly trained on the development data may be disqualified.
- Models should make their predictions for each example independantly. It is the case that different pairs sharing the same premise typically have different labels (as an artificact of how we created the data), but systems are not allowed to exploit this signal to model joint distributions over multiple examples at once. (If you aren’t sure whether this applies to your system, it probably doesn’t.)
- For inclusion in the workshop and the final leaderboard, submitters will have to upload a code packagethat can be used to reproduce the submitted results. This code package will not be used as the primary means of evaluation, but it will be made public to encourage both reproducibility and future extension of submitted models. Moreover, for a limited set of sentences, participants will be asked to provide the sentence vectors produced by their best performing model. These vectors will be used in our own analysis of the results.
- The use of outside data is allowed, including raw unannotated text from any source, word vector packages, and knowledge resources like WordNet are explicitly permitted. We will provide links to unlabeled OpenANC data that reflects the target genres.
- All outside data used must be publicly available to allow for reproducibility. Widely-used data with restrictive licenses or licensing fees (such as LDC-distributed corpora) may be allowed at our discretion. Please inquire at the QA forum below.
- The corpus description paper presents the following baselines:
|Model||Matched Test Acc.||Mismatched Test Acc.|
|Most Frequent Class||36.5||35.6|
Note that the paper also presents results with an ESIM. That model relies on attention between sentences and would be ineligible for inclusion in this competition.
Both models are trained on a mix of MultiNLI and SNLI and use GloVe word vectors.
Code (TensorFlow/Python) is available here.
- For inclusion in the workshop and the final leaderboard, you must submit:
- A system description paper of 2–4 pages in ACL format. System description papers will be reviewed for readability and soundness (but not novelty/technical merit) before acceptance.
- A .zip code package that can be used to reproduce the submitted results after being trained on widely-available data files.
- March 24: Training and development data and draft data description paper available, competition begins
- By May 15: Expert-tagged development data for error analysis available
- June 1: Unlabeled test data available, evaluation period begins, Kaggle evaluation site opens
- June 14 (GMT-11, 23:59:59): Evaluation period ends, system description papers and code packages due
- June 16: Winners formally announced
- July 3 (GMT-11, 23:59:59): Reviews due
- July 6: Notification of presentation acceptance
- July 21 (GMT-11, 23:59:59): Camera ready papers due
- September 8: Workshop at EMNLP 2017, Copenhagen: shared task poster session and selected short talks