Build It Break It is a new type of shared task for AI problems that pits AI system "builders" against human "breakers" in an attempt to learn more about the generalizability of current technology.
This page describes the motivation and structure of the shared task.
Sign up to be a Builder or a Breaker
here and get access to all the data and submission instructions!
Want to just download the data? Find the links below:
@Builders: Download
the
Training and Blind Dev Data!
@Breakers: Download
the
Blind Test Data!
It is widely acknowledged that most NLP systems are quite brittle. At the "
Workshop on Representation Learning for NLP" at ACL 2016, Chris Dyer (moderator) asserted that for any linguistically naive model, he could construct an example on which it fails. Most panelists agreed. The fact that they were so flippant about this suggests that there is a lot left to be desired in the traditional model of learning widely adopted from the machine learning community. That model assumes that the data distribution at test time matches that at training time, and that all errors have the same cost.
In software development, we know that pretty much all software has bugs, and nefarious agents may choose to exploit those bugs to sometimes produce surprising, dangerous or advantageous behavior. The
Build it Break it Fix it Contest (BIBIFI) was developed with this in mind, and our shared task brings similar ideas to the natural language processing and linguistics communities, but where "breaking" for us means "causing systems to make incorrect predictions". The goals are multi-fold: we want to build more reliable NLP technology, by stress-testing against an adversary; we want to learn more about what linguistic phenomena our systems are capable of handling so that we can direct research in interesting directions; we want to encourage researchers to think about what assumptions their models are implicitly making by asking them to break them; we want to build a test collection of examples that are not necessarily high probability under the distribution of the training data, but are nonetheless representative of language phenomena that we expect a reasonable NLP system to handle; and finally, we want to increase cross-talk between linguistics and natural language processing researchers.
Our shared task will run in three rounds:
- Building Round: We will release training data (which you can choose to use or ignore as you like) and Builder Teams will build systems for solving that task. Five days before the end of the building round, we will release blind development data on which Builders must make predictions and upload them to our server.
- Breaking Round: Breakers will be given access to the predictions of the Builder systems on the development data, as well as a reasonably large collection of blind test data. From this test data, they must construct minimal pairs that they think will fool the Builders' systems. One sentence in the minimal pair must be a sentence that exists in the blind test data, and the other must be a very small edit of that sentence. Breakers will provide ground truth labels for both sentences in the minimal pair. They must upload these data to our server.
- Judgment Round: All minimal pair sentences will be collected, shuffled, and sent back to the builder teams. They must run their system as is and upload the predictions on all the Breaker data.
Note: We use the term minimal pair more broadly than standardly used in linguistics. For any given pair, the difference between them will ideally be targeted enough such that we can identify a particular linguistic phenomenon that the system is apparently able to handle/unable to handle in making predictions for one member of the pair versus the other.
We will run two tasks in parallel:
- Sentence-level sentiment analysis: This data is derived from the Pang+Lee+Vaithyanathan data from Rotten Tomatoes with further annotation by Socher+al., and with additional new test data crowdsourced. The task is a simple binary classification: given a sentence, predict whether it's positive or negative toward the movie.
- Semantic role labeling as question answering: This task and data are derived from the He+Lewis+Zettlemoyer's work on Question-Answer Driven Semantic Role Labeling. The input is a sentence and a question related to one of the predicates in the sentence, and the output is a span of the sentence that answers the question.
For the purpose of these instructions, we will consider the concrete task of sentiment analysis because it is easier to discuss. The task here is to label the sentiment of a sentence as negative (-1) or positive (+1), as in:
+1 The acting is superb.
-1 This movie tried to be entertaining but failed.
You main job is to construct minimal pairs that you think will fool an NLP system. In order to limit variability here, we will give you a list of sentences that you must choose from to make one of the two sentences in the minimal pair. For instance:
? Every actor in this movie is horrible.
? I love this movie!
You can create minimal pairs using any of these sentences, where you can write your own paired sentence, but you must make it as close to the first as possible. For instance, you might choose to riff on the first example above, and make a minimal change to the wording that changes the sentiment, generating the following pair:
-1 Every actor in this movie is horrible.
+1 Every actor in this movie is wonderful.
This might not be a great pair because even a stupid sentiment analysis system would probably get this right.
You could also riff on the second example above, and not change the sentiment, generating the following pair:
+1 I love this movie!
+1 I'm mad for this movie!
This is more interesting because a simple sentiment system might see the word "mad" and think that this makes the review negative.
Your goal is to construct examples on which the systems make errors on one, but not both, of the sentences in the minimal pair. (Either one is fine.)
If you are wondering whether a particular change is small enough to qualify the pair as a "minimal pair," a good question to ask is "Do I have a sense of what specifically I could conclude about the system's linguistic capacity, if for instance it was able to handle the first sentence but not the second?"
Instructions for QA-SRL coming soon...
In the task of QA-SRL (more details in link above), systems are given a sentence and a question related to one of the predicates in the sentence, with the task of outputting a span of the sentence that answers the question.
For example:
Sentence: UCD finished the 2006 championship as Dublin champions, by beating St Vincents in the final .
Predicate: beating
Question: Who beat someone?
Answer: UCD
For creating minimal pairs in this task, breakers are only to change the original sentence (and, if appropriate, the answer) -- without making any changes to the question. So for instance, one could generate the following minimal pair for the example above:
Sentence': UCD finished the 2006 championship as Dublin champions, when they beat St Vincents in the final.
Answer': UCD (unchanged)
This could be an interesting change because we might predict that the system would now predict the pronoun "they" as the answer to the question, without resolving to UCD.
Note that breakers cannot change the sentence such that the accompanying question is no longer answerable with a substring from the original sentence. For instance, breakers should not make a change such as "Terry fed Parker" --> "Parker was fed" with an accompanying test question of "Who fed Parker", since the answer to that question is no longer contained in the sentence.