After our meeting last week, we decided to drop the error-type annotation step from the ESL HIT and instead to present Turkers with a simple, well-defined task of simply fixing errors. We will try using a second HIT to ask different Turkers to label/annotate the types or corrections for the sake of building a corpus. I am very excited about this approach. I think that the agreement between Turkers will be higher, our data will be cleaner, and task design will be easier to explain/motivate. Even better, it will allow us to work on the interesting project of trying to predict annotations from sentence context and part of speech features. All in all, I'm optimistic about the new direction.
I've written a lot of Python code this week (...woohoo break from Javascript...) and have a nice set of scripts for taking data off of MTurk, parsing it, and dumping it into a CSV. More interestingly, I now have a set of scripts that traces through the edit data and builds a graph that can be used to trace through the changes that were made, recover the original sentence from the final form, and associate a span in the original sentence with a span in the final sentence (see the picture below). I think the graph structure is clean and makes it fairly easy to find associations between changes that can morph quite a bit before they get in their final form. The only awkward part I've run into so far is dealing with insertions. I am trying a version in which I have extra nodes as blanks between all the words in a sentence. This way, inserts are treated somewhat like substitutions, in which a word (and another blank) are substituted for an existing blank. This allows spaces to spawn new words without having to allow random orphan nodes in the graph. (It also parallels nicely to the GUI we have for the HIT. No benefit to that except cleanliness in my mind, but that's worth something.)

The other major change to our HIT was in data cleaning. Right now, I am trying a version that uses the one top-rated sentence from our Choose-Best-Translation HIT for each input sentence, and asks Turkers to edit that translation. The only concern I have about the current version (the best translations) is that they over-represent minor errors (like articles, prepositions, and punctuation) and under-represent the less-studied errors (but ones that are interesting for collecting data) like verb tense, mass count, and word form. I am also going to try a version of the HIT that gives multiple versions of the same sentence for editing. My concern with this version is that it will be too repetitive, or will cause Turkers to make edits that are not necessarily justified given the original sentence but rather are taken from the other versions of the same sentence. These are all just speculations from the little bits of data at which I am looking, and will be easier to test once we are collecting more data.