Sunday, December 30, 2012

When in Moldova, do as the Danish do

As promised, here is a breakdown of where the translators for each individual language are coming from. These are geolocations, not self-reported locations. I am hoping to do the same thing for self-reported HITs and for the misreported HITs only, but there is a little more cleaning that needs to be done to make the country names sync.

Some notes:

  • Telugu (te) and Tagalog (tl) are, logically, largely translated by people in India...
  • ...so are French (fr) and Serbo-Croatian (sh)
  • Danish (da) is almost exclusively translated by people in Moldova
  • Sudanese (su) is, of course, translated mostly by Pakistanis




Everyone wishes they spoke waray-waray


I started looking at little closer at the turkers who misreport their countries, and addedd the new data files to the zip here.

Overall, we had 9072 assignments in which the country was misreported, which was attributable to 145 turkers. This is fairly minuscule, since we had over 300,000 assignments total, but I have figures, so I will report it anyway.

Most of the misreporters were located in Macedonia, most of them claimed to be in the US, and most of this misreporting happened on the Waray-waray translation HITs. I am currently working on breaking it out further, to see where exactly the waray-waray decoys are and where they claim to be from, and what percentage of waray-waray HITs they account for. I know, the suspense is killer. Try to hold tight...

Where they are...
Where they say the are... 
What language they are translating...



Saturday, December 29, 2012

Florida isn't a country

Today was an exciting day of manual data cleaning...everyone's favorite part of research. The result is this lovely bar graph, which shows the distribution of self-reported countries (top), compared to yesterday's distribution of javascript-geolocations (bottom). The conclusion, luckily, is that Turkers are good at reporting their current locations (our turkers are oriented and conscious...). 



I think these results could be more interesting when looking at HITs for specific languages- i.e. I think if we look at who is performing Icelandic translation HITs, there may be a surprising number of people in Bangalore. Hopefully I can have this analysis up tonight or tomorrow.

Things I learned today: NEVER allow open-ended responses on a survey. My favorite answers to 'which country are you currently living in?'. Can you guess these countries? (answers below)
  1. 33 
  2. rtsjrt
  3. Lives in India
  4. Florida
  5. k

Answers:
  1. Philippines
  2. India
  3. India
  4. US
  5. Greece






Friday, December 28, 2012

Tables of numbers

Happy Christmas, holidays, New Years, Decembers, Fridays, fewer-spam-emails-this-morning, or whatever you might be celebrating.

Picking up from a while back in the semester, I am looking again at Turker demographics, specifically their language abilities. In contrast to the analysis I had begun earlier, I am looking at things on a by-turker basis now. 

I generated a list of tuples of (turker_id, geo-location, self-reported language, self-reported country) which can be downloaded here. I weeded out turkers who reported more than one native language. This left 2652 turkers to study, less than half of all the turkers in the full dataset.

No language reported2556
One language reported2652
More languages reported828

Among turkers who reported multiple languages to be their native language, some reported only a modest two, others were bold and claimed more than 5 native languages. One turker went to town, and claimed a full 15 native languages, giving the EU a run for its money. Of those claiming multiple languages, English appears in nearly every list, accounting for most of the double-native-languagers.

Num languages reportedFrequency
2684
394
423
57
68
74
81
92
103
151


Here are distributions of the top 15 most represented countries and languages in terms of number of turkers.
15 most represented countries

15 most represented languages
As an interesting comparison to my work from earlier this semester, here is the by-country distribution in terms of number of HITs submitted (rather than number of turkers).

Most represented countries by number of HITs submitted

Also, out of curiosity, since English was the most common language while India is the most represented country, I checked the most represented countries among only English speakers:
15 most represented countries among self-reported English speakers



Wednesday, November 21, 2012

Actual Native Languages

Is response to Chris's comment, here is a better table of native languages (counted in number of assignments).

Urdu3686
Macedonian3604
Telugu3589
Malayalam3509
Romanian3268
Tagalog3103
Polish3083
Marathi2970
Spanish2919
Hindi2911
Portuguese2823
Arabic2722
Dutch2594
Kannada2526
Javanese2521
Tamil2457
Asturian2319
Vietnamese2311
Serbian2213
Punjabi2197
Russian2145
Gujarati2139
Nepali2131
Newar / Nepal Bhasa2111
Indonesian2081
French1931
Albanian1894
Cebuano1797
Serbo-Croatian1768
Hungarian1745
Malay1672
Bengali1636
Norwegian (Bokmal)1606
Sicilian1564
Japanese1540
German1519
Bishnupriya Manipuri1417
Luxembourgish1358
Ukrainian1347
Italian1333
Greek1331
Kapampangan1204
Belarusian1154
Swedish1147
Swahili1047
Lithuanian984
Croatian981
Waray-Waray975
Bulgarian892
Turkish828
Norwegian (Nynorsk)729
Central_Bicolano720
Slovenian703
Bosnian613
Hebrew589
Chinese552
Danish535
Ilokano527
Thai508
Basque403
Afrikaans400
Czech381
Finnish380
Slovak368
Catalan329
Neapolitan323
Georgian321
Galician299
Irish260
Haitian245
Korean241
Armenian199
Icelandic191
Sindhi191
Esperanto181
Sundanese148
Wolof132
Pashto119
Persian115
Latvian109
West Frisian60
Breton48
Somali30
Malagasy29
Welsh28
Yoruba25
Uzbek25
Tibetan20
Amharic18
Piedmontese5
Kurdish2
Azerbaijani1
Zazaki1
Low Saxon1
Walloon0
Quechua0
Aragonese0
Tatar0
Kazakh0
Ido0

Tuesday, November 20, 2012

Red Hot Data...

More data on the Turkers:

We unfortunately didn't ask the Turkers what their native language was, so I am using the language in which the survey was written as best approximation. Here is list of number of surveys completed in each language. Not too informative, since it is more a measure of how many HITs were posed in each language.


Counts were 3750 for all countries not in the table. (i.e. Telugu Bulgarian French Tamil Haitian Nepali Sundanese Albanian Tagalog Serbian Malayalam Italian Norwegian (Nynorsk) Czech Slovak Urdu Polish Newar / Nepal Bhasa Swedish Marathi Slovenian Azerbaijani Danish Indonesian Galician Afrikaans Tibetan Central Uzbek Serbo-Croatian Punjabi Greek Kapampangan Latvian Croatian Arabic Breton Icelandic Turkish Waray-Waray Gujarati Hindi Vietnamese Hungarian Catalan Wolof Bosnian Lithuanian Malay Luxembourgish Russian Cebuano Bengali Norwegian (BokmÃ¥l) Javanese Irish Sicilian German West Frisian Belarusian Kannada Macedonian Basque Romanian Somali Ilokano Dutch Finnish Ukrainian Welsh Asturian Portuguese Spanish Esperanto)


Graphs of the Turkers' approval rates by country (top 25 countries by assignments) and by language (worst 40 languages).



For the lowest approval-rate languages, here is the a table of the primary location of Turkers completing that HIT, as well as average years of English and source language.


Tuesday, November 13, 2012

Portrait of the turker as a...young turker...

It has been a very long time, but I have finally had a free afternoon to do some non-class work. I'll save the blabbing about how busy this semester is for my other blog (the one that I update even less than this one) and instead talk about some Turker data analysis.

I am looking at the data that Dmitry left in his data base and pulling together some information about who our Turkers are and what we can say about their translation/language abilities. So far, I've made a few high-level graphs of Turkers' countries and reported language experience and their tendency to be approved/rejected by us. I am working on doing some more analysis to compare their native languages/countries to the languages in which they are actually performing HITs, and the corresponding approval/rejection ratios.

To start, distributions of numbers of workers by country. About half of all of our translators are in India...
Self-reported years of experience speaking English and speaking the source language (the language from which they are translating). This data was filtered to only include the people who responded with actual numbers. I had to throw out responses like 'Many', 'A lot' and my personal favorite, 'nil' (honesty always appreciated).


This is a scatter plot of Turkers' individual approval rates against their years speaking English. I was hoping it would show something more than it did, but I think the high number of Turkers who only submitted one or two HITs and got 100% approval makes the distribution somewhat unsatisfying.

As an alternative, I tried viewing the distribution of years of English separately among approved and rejected workers. The notable problem with this is that it is on a per-assignment basis, so if one worker submitted 100 assignments and had them all rejected, their years-English would count as 100 different data points in the rejected distribution. I don't think this is a drastic problem, since few workers have approval rates far below 50%, but I will try to find a cleaner way to represent it. 



Wednesday, October 24, 2012

Our overly-bloggy controls

Chris observed that the sentences I showed in my last post seemed more like commentary than news. Looking into this, I realized I only ran the code on a subset of the sentences to save time, and it just happened that the sentences that I used to generate these examples were disproportionately taken from the newgroup/blog sources rather than the straight bbc sources (about 3 : 1). For these sentences, the google queries tended to return sites like bbc.co.uk/blogs. 

However, there were still a handful of BBC sentences (examples below) and they, too, appear to be drawing from blog sources. It could be that the blogs just generate much more traffic than the main news, so we are much more likely to hit blogs than real news. (Who are all these people clogging up our internet with their blogs? They should be stopped.) One option is to just block the bbc.co.uk/blogs from the search completely, or block it selectively only for the BBC source sentences.

Some examples of controls trying to match sentences from the BBC news sources:

140 year old, 9 kilo lobster Freed.
George the Giant Lobster was caught in the sea two weeks earlier.
A seafood company of the city had bought him for 100 dollars.
And immediately starting using the lobster as means of their advertisement.
The stock piling of Dollars.

They were aged between 29 to 74.
Why is there decrease in exercise with increase in age?
She has done extensive research on older age and the ability to exercise with increasing age.
A team of doctors used High Frequency rays to determine the functioning of the heart.
I suppose everyone has to counter their brain's tendency to adhere to a pattern determined roughly by age?

There has been a detailed report about this in the American Medical Association Publication.
Exercise is good for health in any age.
This makes your heart stronger, decreases cholesterol levels, helps in reducing weight, helps stop diabetes and decreases the chances of Alzheimer's.
Unfortunately, a lot of people tend to exercise less with age.
Unfortunately, PB, science doesn't work that way.



Thursday, October 18, 2012

More on controls (moron controls?)

Another round of pulling controls offline. This time, I used the Google Custom Search API, which is working much better than New York Times. The restriction is that you have to specify which sites you want to search during your queries; I suppose this is a limitation if you are trying to build a bigger and better search engine by piggybacking off of Google, but since we really just wanted to search BBC anyway, it works well. The other limitation is that the free account only gives you 100 queries per day. Granted, it is only $5 per 1000 queries after that, and I don't envision needing many more than 1500, since we only do about one query per HIT. For today, I just got around it by caching my 100 results and making fake google queries. It was that or register my turtle for a Google developer account...which I should probably do anyway.

Here are some sample sentences. I got these by using the top 5 words in each group of translations (by tf*idf) and then choosing the top sentence out of the pages returned (usually around 10 or 20 links). There are still a decent number of queries that return nothing; I think this is because the 5 query terms together are too restrictive. I may try querying for 5 for each HIT, but if nothing is returned, dropping it down to 3 and then to 1. I had also taken off the lower bound on length, so there seems to be some preference for short sentences, although it is not obvious from these examples. I also think I will try taking just the top one or two documents, rather than 10 or 20, mostly for speed up and because I don't image it will hurt quality.

  • What do you think what are the objectives of Taliban to end women education ?
  • Why Government is unable to stop these unlawful act.
  • What do you think what Government should do
  • Question is that what are the benefits for ISI and Pakistani Agencies.
  • So the rumor is planted with the new agencies.

  • For every action there is a reaction
  • You can tolerate till your own but
  • According to Democracy, Majority is the ruler of Nation and who are they picture is showing
  • The picture look like a picture of Family
  • Democracy, written communication, and philosophy were quite "rare" to them.

  • What will you going to do at Prime Minister's statement
  • Please give us your opinion
  • American President George W Bush in his last Press conference advises newly elected President that he Did many mistakes in his tenure and Barack Obama should do what he think is right
  • I as the President and according with the constitution of USA and law i did the developments, i put forward American interest on my own popularity.
  • If the Scottish National Party is paying attention to these developments, it should be encouraged, I venture to suggest, by what it may discover in so doing.

  • No one is going to say anything about the way that this historical figure is being falsely accused.
  • Some things are ignored because nothing can be done about them.
  • Sir/Madam, maybe if you read not only the history of Pakistan, but of it's neighboring four countries, or of seven other countries, then you might understand the meaning of Mr. Psycho.
  • One whose nature is wrong cannot be taught.
  • It's all poppycock. 

Wednesday, October 17, 2012

Another Attempt at Controls

Per our previous discussion, I began working on pulling control sentences from an external source, instead of using the professional translations. Since our data is English news data, and since New York Times has a nice open API, I decided to try to pull sentences from NYT and see if we could construct decent matches. 

My plan was:
- Take the four Turker translations we are trying to match
- Calculate the top k words from those translations by tf*idf score
- Query NYT using those k words as search terms
- Pull the returned articles and break them into sentences
- Use tf*idf overlap scoring we'd used on the wikipedia controls to find the best sentence

Reasonable as it seemed, this plan didn't produce great results. First, very few articles were returned from my queries; most searches returned nothing. I'd originally set k to 5, but this was apparently too restrictive for the search. I was able to get results if I dropped k to 1, but, needless to say, that significantly hurt the matchiness of the sentences returned (see below). Second, the query searches the entire text of the articles but does not return the full text. So the sentences that are actually returned as candidate control sentences are only those from the first paragraph of the matching article. I think it would be worth looking into Google's API instead. I expect it will give better results for higher values of k, and would return full texts of articles.

Here are some example control sentences returned. These are with k = 2, so probably not even worth posting except that they are so amusing.

My personal favorite:
  • But a group for animal rights, "Pita", demanded the lobster's release and now he has been released in the sea again.
  • George was caught in a sea in Canada, after which he spent ten days in the tank of "City Crab and Seafood Company."
  • The age of a lobster can usually be determined by its weight.
  • According to the restaurant manager, Keith Wallanty, they did not mean any harm to the lobster, and put him in the tank only to arouse curiosity amongst their customers.
  • Ten minutes later, I was in the land of carrots.

  • Isn't all this terrorism too? Why does only ones pain seem painful?
  • Aren't other people human too? So why is only Muslim blood so important? Wasn't all the innocent people's blood red too?
  • So why are there no disapproval messages now?
  • I watched the Mumbai attacks investigations like the fire was in my own home. I cannot watch this human suffering. Can't everyone else see this suffering? I wish Muslims all over could come together and think about this.
  • Hacipasa, Turkey - The men stood at the road's edge and watched the war that is inching ever closer to home.

This one is a classic New York Timesy sentence. So theatrical. They can't just open with a fact like Bloomberg or WSJ:
  • In which, the ability of the heart to pump blood, and it's ability to shrink and relax was researched.
  • Experts have concentrated mainly on the left side of the heart which is mainly responsible for the pumping of blood.
  • They discovered that with increase in age, the problems in the rate of heart relaxation also increase.
  • Dr. Patricia says that with increase in age, the problems in the heart relaxation rates increases and thus effects the ability to exercise with growing age.
  • A steady rain was soaking the windows of La Guardia airport when Nancy Thode, an elite frequent flier with Delta air lines, approached a gate agent with a pressing question: had her request for an upgrade cleared?

  • Dr. Carwin Cain was also part of this research.
  • He says that it's surprising to see the permanent effects of improper exercise on the heart.
  • In some instances, blood pressure is a factor in heart anomalies and heart diseases.
  • If doctors and patients can work together when it comes to heart diseases, then they can enjoy the benefits of exercise even with older age.
  • To buy doctors lunch or dinner, for example, but tempting them with lavish gifts was taboo.


Tuesday, October 16, 2012

New Data and New HITs Followup

I regenerated the editing HITs, this time using 5 sentence groups that contain 4 contiguous Turker translations (from the same Turker and the same source document) and one professional control (chosen from the same document the Turker's translations, but a different sentence). Both versions are currently up on the sandbox siteI am going to try finding other controls besides the professional translations, and will hopefully have a version 3 up soon. (Version threes are always the best versions.)

Examples of version 2:


Everything is possible in America
But by spite of this , I are hopeful that the dream of our founders will liveed .
According to management two million people are attending this gathering
Monday is a national holiday in America because on this day one of the greatest saviour of human rights, Martin Luther King was being assisinated
Martin Luther King was black in race

Britains Secretary of State said in front of media in Mumbai and also wrote that the idea of war against terrorism is wrong.Killing people for the sake of threats being faced had been the policy of President Bush
It is to be noted on newly electes American President Obama aslo expressed such veiws of Kashmir .
David Milliband said that Pakistan should not have any soft corner for the extremist groups
Is India creating uncertainity in the region to avoid the resolution of Kashmir issue by big powers?
Are the Mumbai attacks the part of the plan?

Pubslishing something from their own sounds far, they even din't publish the discussions.
And my question to you is do the teachings of Islam , or those of the East or West give any logic , philosophy , or reasoning that approves of the killing in innocent citizens , including people who are praying , laborers , and low-income workers , and employees , by an consequence of American attacks ?
On be half of the muslims
I have got a single question from all of the human right champions.
It is impossible for them to stand infront of you with neclaces if you kill their childeren.

Monday, October 15, 2012

New Data and New HITs

Okay...not new HITs. The same old, recylced HITs. But new data, and that is what really matters.

I reposted the postediting HIT on the sandbox site using Omar's data set, which contains four professional translations in addition to four turker translations for each sentence. The HITs are set up now to show 5 versions of the same sentence, four of which are Turker translations and one of which is a professional translation (randomly chosen) which has errors added. I think this grouping is nice, since the professional sentence tends to blend nicely and in gives the Turkers a lot of context, which makes editing easier.

I am also going to try a grouping that gives four continuous, different sentences from the document, one of which is an errorified professional translation. My instinct is that this will not allow the control to blend as well, but I will try both and see what the consensus is on which one is better ('consensus' being an unbiased poll of an iid draw from the population, i.e. the majority opinion of myself, my boyfriend, Chris, and Matt).

Here are some example groupings under the current set up (controls in red). These and the 97 other HITs are on the sadbox site, and can be found if you search for 'Ellie'.

He said that both countries have extremely different history.
He said that the histories of both countries were extremely different. He said that both countries have entirely different history.
He saids for the two countries have different historical backgrounds . 
He said that the history of both the countries is different.

Brothers, the end result of killing innocent children is destruction.
brothers, the only end to the killer of innocent children is destruction.
Brothers, the consequence of a murderer of children is disaster.
Brothers, the fate of the murderer of innocent children is annihilation.
Brohters , an fate in the killers of innocent children was destruction . 

Seventeenth Ammendment cannot be abolished till Zardari is the President.
as long as Zardari is the president, seventeenth amendment is not going to be cancelled.
17th amendment will not be canceled until Zardari is the President.
As long as the Zardari is preisdent , a seventeenth amendment is not with to be annulling . 
As long Zardari is President 17th amendment will be there.

Tuesday, October 2, 2012

Next Steps...

After a little break as the semester has been starting up, I am returning to my work on this project. My summer work ended up being a lot of infrastructure-building, more than I had originally forseen. Chris and I decided the best thing to do at this point, before going forward, is to step back and try to decide exactly what questions we want to answer with all of this infrastructure, and which experiments we need to run in order to answer them.

At the highest level, the point of all of this work is to use Mturk to get cheaper translation data without taking a hit on quality. (Pun intended.) From this standpoint, the two broad questions we want to answer are:
  1. Does post-editing translations improve translation quality?
  2. Can we automate the quality control process, and save the cost of paying for poor work/QC HITs?
To answer these questions, we need a good measure of translation quality that is intuitive. Following Chris and Omar's paper, it seems logical to use human judgments: we ask Turkers to rank translations, and assume that the most highly ranked translations are better.  We should also have a measure that is automatic and unambiguous, which could be used in automated processes and doesn't rely on Turkers for ranking: TER is a good, easy-to-compute, and widely-used metric for this. It would also be good to have a gold standard, so we can have a sense of 'how much better' one translation is than another. Again, piggy backing off of Omar, we can use his data set which consists of both Turker translations and professional translations of the same sentences. 

Then our question (1) can be  made more concrete:
  • Does post-editing translations improve translation quality?
    1. When ranked by Turkers, are post-edited translations ranked higher than their pre-edited versions?
    2. When calculated against the professional translation, is the TER for post-edited translations lower than for their pre-edited versions?
Question (2) is a little more complicated. TER is a good, automatic metric, but we obviously don't have professional translations against which to compare all our Turker translations during QC. We instead can use an embedded control sentence for which we have a gold standard, and hope that the TER score of a Turker's version of the control against the gold standard is a good indicator of the quality of the rest of the Turker's work. We can test this by comparing the TER scores for a Turker's control with the rankings, as judged by other Turkers. 

Then our question (2) can be made more concrete:
  • Can we automate the quality control process, and save the cost of paying for poor work/QC HITs?
    1. Do Turkers who get low TER scores for their control sentences tend to have their non-control sentences ranked higher? 
    2. Do Turkers who get low TER scores for their control sentences tend to have lower TER scores for their non-control sentences, when calculated against the professional translations.
Luckily, these questions mostly reuse different views of the same data. I think all of the data can be collected in two HITs:

HIT 1: The ESL HIT we already have (the one that has been the star of this blog for the past three months). This HIT will need to be populated with data for which we have professional, gold standard sentences (i.e. Omar's data). The HIT should then have 5 sentences, 4 which are Turker translations taken from Omar's data set, and one which is an artificially created control sentence (more questions about this are below). Omar's data has four translations for each sentence/professional translation. This HIT will produce a data set which has four post-edited versions for each of those four translations. 

HIT 2: Workers are presented with 6 versions of a sentence: one professional version, four Turker post-edited Turker translated versions, and one pre-edited Turker translated version. They will be asked to rank the sentences in order from best to worst. We can expect (and can use as a form of QC) that the professional translation is ranked best, with the post edits in the middle, and hopefully the pre-edited version last. This HIT will produce a data set which has, for a given translation in Omar's data set, a relative ranking of the four Turkers who post-edited that translation. 

Using the output of these two HITs, we can (hopefully) show that post-edited sentences are better than their pre-edited versions, in terms of human judgments and in terms of TER score against the professional version. We can also, for all HITs, plot the TER score on control sentence against average human ranking across the other sentences, and hopefully show that average rankings increase as TER scores on controls decrease. This would suggest that using TER on a control sentence is a good automatic filter for the quality of the worker overall. 

One implementation question I still have is how to choose the control sentences to embed into the HIT. Whereas the data set I was using this summer came with Wikipedia documents attached, and we could pull control sentences this way, Omar's data set does not have this. One idea is to use the gold-standard professional translation with introduced errors as the control; I would worry that this could affect the TERs of the translations against the professional that we want to calculate later, since the Turkers will have, in a sense, seen the professional version as they were editing the nonprofessional versions. I don't know that this would cause a problem, but it is a thought. Another option is to randomize the sentences, which removes the need to 'match context' between the controls and the rest of the HIT. The problem is that this might increase the difficulty of the task.  I would be interested in Chris or Matt's opinions on this? Or my Mom's...since I know she reads this too. All suggestions welcome. :-)



Thursday, September 20, 2012

New Blog

I decided, for the sake of organization, to start a new blog which will focus on grad school related topics not specific to this research project. I will keep updating here with progress in the ESL HIT (coming shortly).

Follow me if you want to get spammed with information about my work and life at Penn. Don't follow me if you just found this blog from Googling an MTurk error that is probably still unsolved.

http://couldnotresolve.blogspot.com/

Sunday, September 9, 2012

Tech Report

Finally...after having attempted to do it at least once a week since June...I have written up a technical report of my work this summer. If you read my long post last month, and just found yourself starving for more details, then you are in luck. There is (always) more information that I would like to add, but this is a good walk through of the work we've done on the ESL HIT this summer, and the state of affairs as of now. Of course, as I was writing it, I realized how much of it (especially re: Quality Controls and Annotation) is about to be redone and will need to be rewritten. But I suppose having a constant stream of things to change is much better than having stalled out.

So enjoy: ESL HIT Technical Report


Sunday, September 2, 2012

MTurk ApproveRejectedAssignment HowTo

As I have mentioned previously - I had a traumatic first experience trying to QC my workers. Out of the 200 assignments I had posted, I rejected about half of them-- nearly all of which were good quality work. The responses from Turkers ranged from anger ("35 rejections?! Really!!?") to sad confusion ("I just want to know what I did wrong...") all of which made me feel terrible.

Luckily, MTurk recently added a API Call to undo accidental rejections. Chris and I reworked some sample code from the Java SDK to make the call, which you can download here. It is very straightforward to run (there are instructions in the tar).

Wednesday, August 29, 2012

Finalizing ESL HIT

I am working on some final kinks in the ESL HIT with the intention of posting a big trial of about 1000 HITs tomorrow (the control sentences are compiling right now...slowly I might add...). The summer is winding up, and its about time to stabilize this HIT and let it start running and collecting data on its own in the real world (...I can't hold its hand forever...). The goal is to have a large number of HITs posted before classes start so that it can be left alone for a while.

1. The main change I've been making is fixing up the control-grading scripts. After my last HIT's snafu* (I accidentally rejected a handful of legitimate workers- I felt like an evil robber baron, selfishly denying my workers pay, I was ready for the revolution to begin at my doorstep) I did some serious reworking of my grading algorithm. Before, I'd been doing a simple comparison in which I just compared two words at the same index:
So if the control sentence was I went to the store.
and we dropped "the" to make it I went to -- store.
and the Turker added "the" back in I went to the store.
I had been simply checking that the forth word in the Turker's sentence was the same forth word that was deleted from the original sentence, and giving them points if it matched. This was a far too basic way of grading. I was assuming that Turkers would make almost no extra edits, and that they would edit in the same order in which I had generated errors. In short, it was a very weak algorithm.
I switched to using the data structure I've described in an earlier post, so that I now trace each initial word to its final word. This requires a lot more code and bookkeeping, but is much more robust to varying orders of edits and extraneous changes in the sentence. 

2. Customized feedback for workers: Rather than just saying "Sorry," we decided to give Turkers helpful feedback when their results don't pass our QC. Now our response includes the control sentence, and the answer we were expecting to receive from them.



3. Reposting HITs once if they are rejected, so that we can be sure we are getting four high-quality responses for each HIT.

And, as always, sundry small changes throughout. With luck, the HIT will be up within the next 12 hours.



*This led to an adventure in the Java MTurk APIs. I will describe the process and post the code tomorrow in a highly-anticipated post: "Rejection and Regret: How to Repay Rejected Turkers and Rekindle Long Lost Data-Acquisition Relationships." Stay tuned...

Thursday, August 23, 2012

Worker Performance against Number of HITs Submitted

Following Chris's suggestion, here is a plot of Turkers' average accuracy against the number of HITs they submit. Unfortunately, our sample is very skewed - most Turkers only submitted one or two HITs, and handful of Turkers took care of the rest of the HITs. 


Wednesday, August 22, 2012

Control-Only HIT : First Results

Results are in for our performance-only HIT, and I have some preliminary analysis. I haven't gotten a chance to run statistics on agreement or break down the results by POS, but I have some good descriptions of how Turkers are performing in terms basic accuracy measurements.

Surprisingly, Turkers seem to make slightly more corrections than necessary. On average, Turkers make 4.59 changes per sentence, whereas there are only 4.52 errors per sentence.


This makes sense, since many Turkers end up making small word-choice corrections which are not actual errors, or misunderstanding the meaning of the sentence and having to make extra changes to compensate. For example, below, the punctuation change was a matter or taste, not necessity. (Ignore the random blue nodes...I have been lazy about fixing up my figure generating script...)



Turkers' accuracy overall is hovering somewhere around 40% (they correct about 40% of the errors that are actually present). I tried calculating the accuracy in two ways:
  • Identifying the right type of error in the right location counts as full credit (blue). I.e. if we introduced an error by changing a preposition, and the Turker changed the same preposition, they get full credit.
  • Identifying the right error in the right location gives half credit; correcting it back to the original version gives the other half credit (green). I.e. if we introduced an error by changing "of" to "in" and the Turker changes "in" to "on", they get half credit; if they change "in" to "of", they get full credit.
Intuitively, the average for the latter is slightly lower. Encouragingly, the distributions are very close.

I liked the stricter measurement of accuracy better, so here it is isolated. (Blue in the lower figure = green in the upper figure.)

Some errors are easier to correct than others. Accuracy for insertion errors was the highest (errors requiring Turkers to remove unnecessary articles) and for deletion errors was the lowest (requiring Turkers to add in missing articles). Changing showed the most discrepancy between the different accuracy calculations (usually because changes were introduced with prepositions and picking the correct preposition leaves room for interpretation). 
Changes had the smoothest distributions - again, likely because of the high variability in choosing the new word to which to change the old word. Deletions and insertions were more bipolar, with Turkers finding either all or none of the errors. 



Monday, August 20, 2012

Creating, Managing, and Controlling External HITs

It has been a little while since my last post, mostly because I've been caught up on a generating and grading the controls for our HIT. The up side is that I've had steady work, lots of code to write, and some quality bonding time with NLTK. The down side is that I have to now attempt not to blitzkrieg you with a discussion 10 days worth of code at a painful level of detail. As a compromise, I want to lay out a broad overview of my contribution to Dmitry's git repo, and explain how I've extended the existing code to support our ESL HIT. I suppose this can also be my tribute to Dmitry and a send-off of sorts as he goes on to bigger and better things. I will try not to become too sentimental...

The general pipeline for creating and running external HITs off of our server (also laid out in the README on git) is as follows:
  1. load data into databases 
  2. divide the data into batches (e.g. 5 sentences per batch for the ESL HIT) generate one HIT per batch 
  3. add the HITs to Mturk
  4. wait for Turkers to do HITs, hit refresh constantly in browser and hope for more results
  5. retrieve* all HIT data from Mturk and load into a buffer in database
  6. read data from buffer into relevant database tables
  7. grade controls for each assignment
  8. approve/reject  based on Turkers' performance
This is a general framework, and describes all the HITs currently in the repo (including the ESL hit, a hit for translating Spanish tweets, and a hit for determining whether two translations have similar meanings). I will discuss the above pipeline in the specific context of the ESL HIT, and try to point out which aspects are unique to ESL and which are common between all HITs.

1. Loading Data into Database (scripts: esl.sql,  load_data_to_db.py)

There are nine tables in the ESL database:
esl_sentences    
hits
esl_assignments        
esl_hits_data              
esl_hits_results           
esl_location        
esl_workers 
esl_controls                
esl_edits      
           
The first (esl_sentences) simply stores all the sentences we are using to generate our HITs; all the hits have an analogous table containing the necessary raw data. The next six tables (hits, assignments, hits_data, hits_results, location, and workers) are also tables which appear analogously in all HITs. The hits table is the most central, and contains Mturk's HIT ids and internal database HIT ids for all the hits. This one is used to link each HIT to the data it contains, the assignments it spurred, the workers assigned to it, etc (see figure below). The final two (controls and edits) store data specific to the ESL HIT; controls holds information about the automatically-generated errors added to our control sentences and edits holds information about each atomic edit made by the Turkers.



The esl_edits table is one of the places that I have broken rank with the other HITs in the repo. While I have an esl_hits_results table, I have not actually been using it, but rather have been storing my results in esl_edits. I thought it was easier and neater to keep the data this way that to add columns to the hits_results table. I have also been ignoring the esl_locations table so far (which will hold data taken from cookies and surveys about educational/language background), but I will probably start using it soon.

All the tables should be built initially by running the esl.sql script (although my script is not up to data, since I've been adding and removing columns from the command line). The load_data_to_db.py script simply populates the esl_sentences table by reading from a .csv file; all other tables are populated later in the pipeline. (Reading the .csv file and populating the table is one area that has to be tweaked for each different HIT, since all have different input data.) 
        

HITs are generated by reading data out of the esl_sentences table, batching the data into bite-sized HITs, and creating entries in both the hits and the esl_hits_data tables to reflect the existence of that HIT. The HIT does not have an Mturk HIT id when it is created (it is not assigned until it has been added to Mturk). In other words, after running either generate_hits.py script, the tables will show one row corresponding to each future HIT, but will have a blank mturk_id entry.

The ESL generate scripts are similar to the other HITs' scripts except that they contain extra code for inserting controls into the HITs. Our ESL controls are created using English sentences that depend on the non-control sentences in the HIT, so it is necessary to create the controls dynamically as the HITs are generated. The generate script takes all the sentences from the esl_sentences table, sorts them by document, splits them into batches of four, and then runs the four sentences through a separate control pipeline (subject of a future post). This control pipeline creates a control sentence, enters that control sentence into the esl_sentences table, and returns the id of the new esl_sentences entry. The generate script creates a new HIT which contains the for original sentences plus the new control.

3. Adding the HITs to Mturk (script: add_esl_hits_to_mturk.py)

I left this script untouched, literally copied and pasted from the other HITs. It uses Dimitry's wrapper around the boto API to add the HITs to Mturk. It retrieves the newly assigned Mturk HIT id and fills in the empty column of the hits table in the database.

4. Waiting (or busy waiting, depending on how often you hit refresh...) (scripts: main.py, esl.tpl, ESLHIT.js)

As Turkers accept HITs, they are populated with the correct sentences (those that were determined during the generate script) using some wonderfully modular python templating libraries. The code in main.py is another area that has to be changed in order to add a new HIT. Each HIT has a .tpl template file, which is largely in html and javascript, but contains fragments of python code which can dynamically populate data based on parameters that are passed to the template. 

For example, the ESL HIT populates its 5 sentences dynamically from the esl_sentences database. The sentences are passed to the template in main.py (see line 316):


@route('/hits/esl/<language>')
@view('esl')
def esl_hit(language):
 #when page is rendered, get assignmentID/hitID and attach it to displayed results
 assignmentid=request.query.assignmentId
 hitid=request.query.hitId
...
 sql = " select es.* from esl_hits_data ehd, hits h, esl_sentences es where es.id=ehd.esl_sentence_id and h.id=ehd.hit_id and h.mturk_hit_id=hitid"
...
 *** sentences = sentences returned from sql querey in array ***
 params={
  "hit_type":"vocabulary-ru",
  "assignmentid":assignmentid,
  "hitid":hitid,
  "sentences":sentences,
  }
 return dict(params=params)
Then the sentences are loaded in esl.tpl from params (see line 274):
var sentences = [ %for sentence in params['sentences']: "{{sentence["sentence"]}}", %end ];
New HITs being added would need to have a similar .tpl file and @route() defined in main.py.


5 and 6. Retrieving and Storing the Data from MTurk (scripts: multi_test.py, buffer_update.py)

Pulling results from MTurk is fairly straightforward. The multi_test.py script uses the same boto library mentioned before to call MTurk and retrieve all the HIT results (including all the fields from your javascript forms) as a giant JSON object and stores them in a buffer table in the database. I left this script largely unchanged. My only alteration was that I decided to enter worker information into the esl_workers table at this point, largely because it simplified the processing I had to do in buffer_update if I could be guarenteed that all workers would have an existing entry in the table.

The buffer_update.py script reads the data out of the buffer and populates the relevant results tables. This script I edited to fit the specific way I wanted my data stored.

7 and 8. Grading Controls and Paying Workers (scripts: buffer_update.py, qc.py)

Once all the data has been entered into the DB and the results are parsed and stored, we are able to review the results, pay workers for good work, and reject work that is not up to par. This post, despite my best intentions, is becoming very long, so I will describe the details of my grading in a follow-up post. I will instead focus on the difference between the ESL QC methodology and the Spanish tweets QC.

I was able to include the ESL QC grading in buffer_update.py since our controls are automatically gradable. This is a notable simplification from the tweet translation HIT, which required a separate HIT for QC, in which new Turkers viewed and gave thumbs up/thumbs down on each translation. Right now, the ESL HIT results are the end result, which makes the WC processing easier. We have discussed piping the results of this ESL HIT into a new error annotation HIT, and potentially feeding results back to the translator, which would make more use of the loops seen in the tweet translation QC.



This post was much longer than I intended, I didn't even get to discuss the control generation and grading itself, which was my intention. I'll cover these in a few sister posts...try to handle the suspense...


*P.S. Every single time I type "retrieve", I spell it "retreive." I had to stop using it in method names. And I am a native English speaker. ::sigh::

Friday, August 10, 2012

Generating Errors

I am working on programatically inserting errors into our control sentences. Rather than using GenERRate, we decided to write our own script to do this. (I was going to fight to use GenERRate just because I think its a cute tool and it seems pretty robust, but then Chris made an indisputable argument: GenERRate is in Java. Our scripts could be in Python. Q.E.D.)

We restricted the errors we are inserting to four relatively easy-to-automate ones:
  • Spelling - randomly switching two adjacent letters
  • Prepositions - randomly change a preposition to another preposition
  • Determiners - randomly add, delete, or change a determiner (only adding before nouns that do not already have determiners)
  • Verbs - randomly switch a verb ending/form (e.g. add 'ing' or delete 's')
A few examples of sentences with generated errors:

After the Soviet invasion of Manchuria and the atomic bombings of Hiroshima and Nagasaki in 1945, Japan agreed to an unconditional surrender on 15 August.

After the Soviet <invasino> of Manchuria and <an> atomic bombings of Hiroshima and Nagasaki <by> 1945, Japan agreed to an unconditional surrender on 15 August. 


Hideyoshi invaded Korea twice, but following defeats by Korean and Ming Chinese forces and Hideyoshi's death, Japanese troops were withdrawn in 1598.

Hideyoshi invaded Korea twice, but following defeats <from> Korean and <the> Ming 
Chinese <forceed> and Hideyoshi's death, Japanese troops were withdrawn in 1598. 


The village administration level is the most influential on a citizen's daily life, and handles matters of a village or neighborhood through an elected lurah or kepala desa (village chief).

The village administration level is the most influential on a citizen's daily life, and handles matters of <> <vlilage> or neighborhood <in> an elected lurah or kepala desa (village chief). 


The Arabian Sea beach lines the southern coastline of Karachi.

The Arabian Sea <beahc> lines the southern coastline <from> <a> Karachi. 


However, the Treaty of Peace with Japan was not signed until 1951, and that with Germany not until 1990.

However, <those> Treaty of Peace with Japan was not signed until 1951, and <of> with Germany not <untli> 1990. 

ESL Controls and Matching Context

I response to Chris's comment, below are some examples of control sentences, and the context in which they'd appear in the HIT, pulled using tf*idf scores calculated using three different sets of sentences for reference:

  • only the 5 other sentences in the HIT
  • the 5 other sentences in the HIT, plus a 3 sentence window on each side
  • all other translations from that document


Intermediate Period: From reorganization till the proclamation of Republic era.
Era of Reformation: It is third era because it started after Sultanat-e-Usmanias end and that is why it is not our topic of discussion.
During the regime of Mohamed Fatheh lot of improvement had occurred in education and was himself a follower of learned people.
Mohammads follower spread education to the mass level and every Sultan used to build a mosque and with that it was mandatory to establish a school.
As a result the number of religious school were increased along with the mosques.
It ended when Mehmed I emerged as the sultan and restored Ottoman power, bringing an end to the Interregnum.
The millets were the major religious groups that were allowed to establish their own communities under Ottoman rule.
In the latter part of this period there were educational and technological reforms, including the establishment of higher education institutions such as the Istanbul Technical University.

Edo period lasted from the year 1603 to the year 1868.
Ayeasu is recognized as the most successful ruler in the history of Japan.
He won several wars through treason.
Although the Emperor always used to be the symbolic head of state in Japan the real power and jurisdiction remained at the disposal of Shogun or the head of military. But Ayeasu established a system of government based on the traditions of both the monarchy and feudalism.
Like Hideyoshi he also initially kept a soft spot for the Christians but the Portuguese and Spanish traders went only towards those places where the Catholic missionaries asked them to go.
Ieyasu was appointed shogun in 1603 and established the Tokugawa shogunate at Edo (modern Tokyo).
Ieyasu was appointed shogun in 1603 and established the Tokugawa shogunate at Edo (modern Tokyo).
Japan has over 90,000 species of wildlife, including the brown bear, the Japanese macaque, the Japanese raccoon dog, and the Japanese giant salamander.

Later according to the Canada Act its name was kept as Canada and now this is the only name being used
A change that was reflected in the renaming of the national holiday from Dominion Day to Canada Day in 1982.
On 7th July 1969 according to the official language in the federal government french was given the status equal to English
From this Canadas journey of being a bilingual country started
English and French languages have equal importance in federal courts parliament and in all federal institutions.
English and French have equal status in federal courts, Parliament, and in all federal institutions.
English and French have equal status in federal courts, Parliament, and in all federal institutions.
Criminal law is solely a federal responsibility and is uniform throughout Canada.

The governmental occasion of PHP
The theme of first chapter is Jew and Christians criterion fulfilled and in place of them the foundation of Ismael (God Bless Him) as new people and their mentioning and their purification and filtration and the last pact with that God.
The second part talks about the Arab non believers and Allahs
The theme of third fourth fifth and sixth chapter is same which is the news of expression and purification and filtration.
The theme seventh and last chapter is to tell the rulers of Quraish about the day of Judgement and telling them the news of penalty and good news of Prophet Mohammed (PBUH) for the dominance of truth on the land of Arabs.
The number of verses differ from chapter to chapter.
As the Quran says, "With the truth we (God) have sent it down and with the truth it has come down.
Defiling or dismembering copies of the Quran is considered Quran desecration.


Much of the variation from using the whole document probably comes from the fact that some HITs contain sentences from two different documents, in which case the tf*idf is calculated using word frequencies over both documents.