Summer 2012 Research Journal: July 2012

Tuesday, July 31, 2012

Preliminary Data Analysis (Part II: Some Results)

I started with two very rough views of the data:

First, I broke down all of the edits that were made by two categories: the number of redundant edits we had for the sentence, and the kind of edit that was made (change/insert/delete/move). This was mostly just to get an idea of where the big restrictions are in the data. Since most of the sentences have 3 redundant edits, I think I will focus mostly on the 3-times-edited subset of the data for most of my future analysis on this data set, for simplicity.

Second, I started to look at a very basic measure of agreement. The metric I am using right now is just the number of annotators that make an edit, given that at least one annotator makes that edit; the first annotator to make the edit is not counted, so that if only one annotator makes a change, the "agreement" is 0. For example, in a sentence with 4 annotators, if a word is deleted by 2 annotators, the agreement on the deletion of that word is (2 -1) / (4 - 1) = 0.33. (Again, the - 1 is because the first annotator does not count as agreeing with himself). The graph below is the average "agreement" over all the examples of the given edit mode, within each redundancy level. It is interesting to note that there is no agreement between annotators on reordering edits; this is very likely due to the fact that there are so few examples over which to compare.

(*)All of the 1 x redundant edits are 0 by definition. (**) The measure is very sensitive to the number of redundant translations available, so comparing y values between the four quadrants isn't very informative. While this won't be a problem once we have 4 redundancies for all sentences, I am still considering looking into metrics that account for the increased chance of randomly agreeing as more annotators are added.

The agreement numbers are very low on average. As a next step, I would like to look into the agreement broken down by POS, since manually flipping through edits suggests that agreement on punctuation and articles tends to be higher that on average. The immediate issue with this is that the final sentences, which I would try to POS-tag, are not always well formed. I am hoping that most of the words are unambiguous enough not to overly confuse the tagger...to be continued...

Preliminary Data Analysis (Part I: Some Data Imperfections)

After getting the data off of MTurk and tweaking my data structure a bit, I am finally getting to pick through the data. Here are some of the problems I am having:

The way I am populating the sentences for the HITs isn't correct. I posted 50 HITs which were each supposed to be done by 4 turkers, but not all of the sentences have 4 redundant corrections. Instead, sentences have either 1, 2, 3, or 4 edits, which makes it look like the sentences were somewhat randomly populated each time a turker accepted the HIT. I will have to look into this before we post more HITs, since it is not ideal for making comparisons across the data. For now, I am just dividing the sentences into 4 sets, based on how many redundant corrections they have, and looking at each set separately.
Some of the edits actually look worse after editing than before (see below). Many more are fixed in part but not in full. This set of HITs did not use any sort of control, which is a work in progress for the next batch, so that may help with these problems. I also am not sure that we used any filtering this time around to ensure that workers were based in the US, which may also help. There are a fair number of sentences which are not edited at all. It is possible that something as simple as having a button that says "No edits needed" and enforcing that either an edit is made or that button is clicked before moving to the next sentence might help prevent sentences from being skipped without editing. Even more of a deterrent for skipping sentences would be requiring some sort of explanation if no edits are made. This seems annoying to me, but could help. Hopefully the controls will be sufficient.

Some of the edits are just confusing. Luckily, this one is something of an outlier.

Thursday, July 26, 2012

HIT Posted!

I posted the HIT on MTurk this morning, and after about an hour, 20 of the HITs have already been completed! I'm thrilled - I have to say, I was not expecting such instant results.

Below are a few images of redundant revisions of sentences, for comparison. As we predicted, a lot of the editors make minimal corrections, and the final version of the sentence is better, but not perfect.

Just flipping through the corrections suggests that the overlap between editors on how to correct things will be very low compared to the overlap on what to correct. The sentences below are a good example of this. Matt and I had discussed using parse trees to tease out error annotations as part of automatic correcting. I think the example below is the kind where this could be useful, since we would need to capture the fact that the problem is not with either word in isolation, but rather with the fact that small/long should not be used for comparison.

Tuesday, July 24, 2012

Edit Data Structure Visualization

After Chris and Matt's positive feedback on the visualization of the data structure I described yesterday (and after realizing it is much easier to debug the actual data structure when it is a picture) I wrote another script using one of python's graphics modules to automatically generate the figure. So now we can throw any set of edits at the script and crank out these figures. I have to say this was one of my most productive train rides...usually I listen to Grizzly Bear and fall asleep after 5 minutes...

Monday, July 23, 2012

After our meeting last week, we decided to drop the error-type annotation step from the ESL HIT and instead to present Turkers with a simple, well-defined task of simply fixing errors. We will try using a second HIT to ask different Turkers to label/annotate the types or corrections for the sake of building a corpus. I am very excited about this approach. I think that the agreement between Turkers will be higher, our data will be cleaner, and task design will be easier to explain/motivate. Even better, it will allow us to work on the interesting project of trying to predict annotations from sentence context and part of speech features. All in all, I'm optimistic about the new direction.

I've written a lot of Python code this week (...woohoo break from Javascript...) and have a nice set of scripts for taking data off of MTurk, parsing it, and dumping it into a CSV. More interestingly, I now have a set of scripts that traces through the edit data and builds a graph that can be used to trace through the changes that were made, recover the original sentence from the final form, and associate a span in the original sentence with a span in the final sentence (see the picture below). I think the graph structure is clean and makes it fairly easy to find associations between changes that can morph quite a bit before they get in their final form. The only awkward part I've run into so far is dealing with insertions. I am trying a version in which I have extra nodes as blanks between all the words in a sentence. This way, inserts are treated somewhat like substitutions, in which a word (and another blank) are substituted for an existing blank. This allows spaces to spawn new words without having to allow random orphan nodes in the graph. (It also parallels nicely to the GUI we have for the HIT. No benefit to that except cleanliness in my mind, but that's worth something.)

The other major change to our HIT was in data cleaning. Right now, I am trying a version that uses the one top-rated sentence from our Choose-Best-Translation HIT for each input sentence, and asks Turkers to edit that translation. The only concern I have about the current version (the best translations) is that they over-represent minor errors (like articles, prepositions, and punctuation) and under-represent the less-studied errors (but ones that are interesting for collecting data) like verb tense, mass count, and word form. I am also going to try a version of the HIT that gives multiple versions of the same sentence for editing. My concern with this version is that it will be too repetitive, or will cause Turkers to make edits that are not necessarily justified given the original sentence but rather are taken from the other versions of the same sentence. These are all just speculations from the little bits of data at which I am looking, and will be easier to test once we are collecting more data.

Friday, July 13, 2012

July 13th - MTurk and JSON

I have not been able to reproduce my URI-too-long error that I mentioned in the last post. I have submitted obscene numbers of edits on the HITs, and they have submitted as no complaints. This is somewhat unsettling - persistent and predictablebugss are far better than sporadic, get-you-when-you-least-expect-it guerrilla-warefare-style bugs.

I have been creating data tables and sql functions, and have been able to retrieve data from the hits I have submitted. The problem is that all the data I retrieve is blank. My hidden javascript forms do not seem to be functioning anymore. I cannot figure out a good reason for this: when I run it in on my machine, and even when I run as a HIT that is not an external question, my $(id).val(turkerResponse) calls are able to populate the tables as expected and I am able to download/view the results. For some reason, these value-setters are not setting the values of my forms when running as an external question. If I make the form visible and type into it manually, I am able to see the result when I retrieve it on the server, but if I set the value with jQuery, I retrieve keys mapped to null values.

So I think my next step is to do some heavy refactoring and start sending the data as a JSON instead of via hidden forms. This is the way Dmitry is passing data in the other HITs, and it seems fairly straightforward for both javascript and python. It will also make data easier to parse, and will probably prevent any future URI-too-long issues, if they were to start happening again.

Wednesday, July 11, 2012

Apache, SQL, Git, and other real-world problems

My work on the ESL HIT has become very practical this week, very different from my reading academic linguistics papers and organizing categories of errors. I have been working with Dmitry on posting the HIT on the MTurk worker sandbox via Amazon's "external question" protocol. The good news is that it is up and superficially functional. But it is still very much a work in progress.

Here is my newest, freshest, error - and I get the feeling this one could be a somewhat serious problem. I went through an entire HIT, fixing errors and trying to emulate the turker experience, but when I hit submit I got this message:

I think MTurk typically sends results using a long URL, with all the data fields appended. Something of the form:

https://workersandbox.mturk.com/mturk/accept?hitId=2LISPW1INMHGYD3SWS1B9BFMLIW4H0 &prevHitSubmitted=false&prevReward=USD0.15&hitAutoAppDelayInSeconds=604800&groupId=2FH56XBAT2D5NDXWYKUQ3JGEY7M04T&Answer_field1=Something&Answer_field2=Something&Answer_field3=Something&Answer_field4=Something&...

Our HIT is saving a lot of data, and I can see how this URL could get very long. I am hoping that there is a simple way to change the way we receive our data from MTurk, possibly as a JSON or some other friendly, easy-to-parse format. I will begin investigating...

But its worth noting that I have had some little victories this week. I have to say, it is somewhat embarrassing how little I know about this giant aspect of computer science. As I mentioned, I have no experience with web programming, client-server protocols, databases...really any of the fundamental components of the internet. I am so glad I am getting a chance to play with it now, since it'd be great to start as a grad student without still having basic sql tutorials on my bookmarks bar.

Of course, "playing with it" means a lot of tiny, cautious tweaks to Dmitry's code, and a LOT of emails to Dmitry (who deserves credit for his endless patience). I think its been a productive week, though. Some accomplishments:

Getting my code to run from our apache server: Dmitry helped me rewrite some scripts to post the HIT from the server, but it took me a while to get them to run and to populate the input sentences with sentences from the database. This required running the scripts in the right order, making sure all my paths and refs were correct (I'd messed up a few of them while I was running it on my local machine), and formatting the URL correctly to actually verify if hits were created.
Reaching a reasonable level of comfort manipulating the database: I had been posting HITs for myself to practice, and realized that when I deleted the HITs from MTurk, they still had entries in the database. This meant that running the delete-from-mturk script had to cycle through thousands of old, deleted HITs and would run for half and hour or more. I decided I needed to clear the old entries out of the DB. This actually took a while. (It didn't help that executing anything with the word "DELETE" when you aren't 100% sure what you're doing causes a small-scale panic attack every time.) I am still not able to make psql work for me, but Python saved the day (cutest little language ever).
Comfort with Git: This has been a long time coming. I am still not a pro, but I now know what I'm doing for the basic operations, and don't need to use tutorials every time I commit. I know using git is an absolute must in the real world, and that keeping folders of old files as a method of self-version-control is not exactly sustainable. (Although this is how I was taught to do things in econ...and surprisingly, this is how Goldman Sachs does version control.)

Now back to working on the Apache-can't-handle-your-giant-url error. Fingers crossed...

Thursday, July 5, 2012

July 5th - ESL Hit

I am in the process of trying to post the ESL Hit on MTurk. I was able to work with Dmitry and make good progress on connecting the hit to the External Hit structure that he has in place, which should allow it to run off of our server and will simplify the process of posting Hits and retrieving data.

Of course, like every idea to "simplify" work, this is causing me a lot of trouble. I am not able to get the Hit to load properly, and it doesn't look like it can find any of the utility files it needs to run. While I'm in the process of figuring this out (which might take a while, since I have no experience yet with this kind of web backend work...) I thought I could just post it the low-tech way, not as an external hit, so that I could start getting data in the mean time (I'm very impatient to get to start doing the ML part of this project). But this file is getting very large and MTurk no longer likes it when I copy and paste it; it appears to lose some code when I do so, and then the Hit either doesn't load properly or it submits early. So hopefully one of these issues can be figured out by the end of the day, and this will be posted- hopefully through our JHU server but, at the very least, on MTurk in some way, shape, or form.