Monday, January 21, 2013

Two pairs of graphs and a paragraph

First, a sketchy outline for a related works section of a paper. This probably would have been a good first-step, before I started cranking out graphs, but better late than never. Any suggestions on other papers I should read and include are much appreciated:

Early demographic studies by Ipeirotis revealed that while the majority of Turkers were located in the US, India accounted for a strong 30% of Turkers. Follow up research by Ross et al. suggested that the international presence on MTurk has been growing over time, with India accounting for 36% of workers at the time of the study. While there has not yet been a thorough investigation of Turkers' language abilities, Munro compiled survey responses of 2000 Turkers, revealing that four of the six most represented languages come from India (top six being Hindi, Malayalam, Tamil, Spanish, French, and Telugu). 

NLP and ML researchers have shown an increasing interest in using Mturk as a means of data collection. Snow et al. describes the success of using redundant non-expert labels to substitute for professional annotations, acheiving comparable quality for much lower cost. The tasks performed by Snow, however, are kept simple and accessible for the average English-speaking Turker. As NLP research advances, the level of expertise required from annotators advances as well. Callison-Burch et al. report success using MTurk to build parallel corpora for Machine Translation, a task which requires Turkers to speak two languages with a high level of proficiency. As the number of international and bilingual Turkers grows, particularly Turkers speaking low-resource languages, it is natural to ask to what extent we can rely on MTurk for accurate translations, and how confidently can we screen Turkers to ensure high-quality results. 
  • Callison-Burch, Chris, and Mark Dredze. "Creating speech and language data with Amazon's Mechanical Turk." Proceedings of the NAACL HLT 2010 Workshop on Creating Speech and Language Data with Amazon's Mechanical Turk. Association for Computational Linguistics, 2010.
  • Downs, Julie S., et al. "Are your participants gaming the system?: screening mechanical turk workers." Proceedings of the 28th international conference on Human factors in computing systems. ACM, 2010.
  • Munro, Robert and Tily, Hal. "The Start of the Art: An Introduction to Crowdsourcing Technologies for Language and Cognition Studies." 
  • Ross, Joel, et al. "Who are the crowdworkers?: shifting demographics in mechanical turk." Proceedings of the 28th of the international conference extended abstracts on Human factors in computing systems. ACM, 2010.
  • Snow, Rion, et al. "Cheap and fast---but is it good?: evaluating non-expert annotations for natural language tasks." Proceedings of the Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2008.
And then, some scatter plots of quality against number of assignments submitted, by country and reported native language. In the plots on the right, the points are resized proportional to the number of Turkers contributing. (More specifically, size is the number of assignments-per-Turker, so bigger circles mean that a few Turkers were performing a lot of HITs). 
(The x axis is number of individual controls graded, so directly proportional to the number of assignments, but blown up by about an order of magnitude.) 

Each point represents a country


Each point represents a language














Wednesday, January 16, 2013

Rules are made to be spoken

I don't even know what I am doing with titles, anymore. Apparently this is my sense of humor, now. Rhymes...my standards are falling fast...


An interesting graph of the quality of results from the translation HIT for native and non-native speakers across HIT languages. Notable are the large number of languages for which self-reported non-native speakers seem to perform better that self-reported native speakers: Telugu, Tamil, Malay, Tagalog, Polish, Malayalam, Marathi, Vietnamese, Italian, Arabic, and French. My suspicion is that the lower performance is due to the fact that non-native speakers had stronger English and so were better able to perform our task.

(Graph shows only languages for which at least 200 native and 200 non-native assignments were submitted).

Tuesday, January 15, 2013

Babel of towers

Here's a breakdown of some of the quality of some of our high-traffic HIT languages, in terms of the locations of the turkers. Not surprisingly, the higher submissions tend to come from countries where the HIT language is likely to be spoken. On exception is that Russians are apparently not that good at Russia. Shown are the 6 languages in which had the highest number of assignments submitted...

But first! Some whining and excuses:
- I admit, bar graph is not an ideal visualization of this, a table would probably be nicer...but laziness kicks in. I'll make it into a nice table tomorrow.
- These show any country from which at least 5 assignments were submitted, hence the large error bars on some. 
- N/A means that we do not have country information for that assignment. These should (and will eventually) be trimmed out...

Urdu


Macedonian
Telugu
Malayalam

Russian

Spanish
The same idea, but breakdown of quality of 6 most represented countries (by assignments submitted) in terms of HIT language. Again, no big surprises: turkers in India do better on Indian languages than European languages. Turkers in the US also do surprisingly better on Indian languages, but this is likely because they are being submitted by Turkers who are not born in the US. I will rerun this analysis with reported native language (Although we have sadly less data for reported native languages. I know, life is tough. We persevere.)

India

US

Macedonia

Philippines

Moldova

Malaysia



Friday, January 11, 2013

Graphs of slightly less ambiguous scores


I'm just going to go ahead and do a take-two on that last post, this time with a quality metric that might actually mean something. Or at one that we know what it is.

We had run some of the submitted translations through a second HIT, where the translations were compared against a known correct translation, and judged yes/no on whether the turker* translation and the known translation were synonyms. So the new way of reporting quality is the proportion of translations that were judged to be synonymous with a known answer. Some results are below. Note that even though I said judgements were yes/no, translations were judged on a yes/no/kinda scale, but I considered 'kinda' to be 'no.'

Scores are also a little (a lot) lower than one would ideally want them to be, but these tasks included a large number of technical words, which might not have been part of the average Turker's vocabulary, even if their language skills weren't bad. ('What do you mean you lived in Barcelona for 5 years and never learned the Spanish word for epipaleolithic?!?') Still, our graders were fairly liberal with their 'yes' votes. (Some good synonyms for 'star':  'STAR', 'Moon', and 'a star is a massive, luminous ball of plasma held together by gravity. the nearest star to earth is the sun, which is the source of most of the energy on earth. other stars are visible in the night sky, when they are not outshone by the sun. historically, the most prominent stars on the celestial sphere were grouped together into constellations and asterisms, and the brightest stars gained proper names. extensive catalogues of stars have been assembled by astronomers, which provide standardized star.')

Intuitvely, and encouragingly, HITs in languages with high numbers of native speakers had better-than average quality ratings.
Quality by HIT language

Turker-reported native language

Surprisingly, and not-so-encouragingly, countries with national languages for which we supplied many HITs did not consistantly do better than average. That is, even countries with ample opportunity to shine on their own native languages didn't produce markedly higher quality. One possible reason is that the Turkers from these well-represented countries got so excited about our HITs in their native languages, that they decided to roll the dice on some random languages like Waraywaray and Javanese. I'll poke a little more to see if I can back this up.

Quality by geolocation

Number of HITs per source language

As in the last post, there's a difference b/w misreporting turkers and honest turkers, although the difference is very small.

Avg. 99% Conf. Int.n
Overall 0.281(0.279, 0.282)307390
Misreport 0.252(0.244, 0.260)10479
Correct report 0.282(0.280, 0.283)296911



*I keep typing 'truker' instead of 'turker.' Perhaps there should be a MechanicalTruck, to give truckers something to do turning those long, boring rides. Data first, safety second. 

Thursday, January 3, 2013

Graphs of mysterious scores


Some initial results on quality. Not to undersell, but I am not really sure what these charts mean, so take them with a grain of salt.

I am using a 'quality' measure which is available on most of translations submitted for the vocabulary HIT. This is a score (either 0, 0.5, or 1) which was assigned in a translation-rating HIT. From what I recall from Dmitry, this was 0 for poor quality, 1 for good quality. I believe there was also some adjustment made for looking like machine translation. I am going to follow up with Dmitry to find out exactly what these scores mean. My main concern is that the data looks suspiciously clean (very very high agreement across raters), so I am really not sure what to make of it, or if its worth using to draw any conclusions.

All that aside, I decided to make some graphs anyway, because what the hell. So assuming these scores mean something, I have some figures for average translation quality across a few cuts of the data. Among the sea of data, it is worth highlighting that that people misreporting their location do appear to produce weaker translations.

Avg. 99% Conf. Int.n
Overall 0.823(0.821, 0.825)124063
Misreport 0.785 (0.775, 0.794)8449
Correct report 0.825 (0.823, 0.828)115614

By location
By HIT language

Wednesday, January 2, 2013

January 2013 - appropriately reported by Summer 2012 Research Journal

My new years resolution: only work with unlimited quantities of perfect data. So not off to an excellent start, but I am doing my part by organizing the data I have. Since I cleaned up the giant mess of name-that-country that I was working with before, I reproduced some of the graphs from earlier, and reposted the data I am using to generate them. I am soon to embark on studies of the quality-control side of this data, so now seems as good a time as any to summarize the state of affairs so far.  

We posted HITs in a ton of different languages


And they were picked up by mostly by Turkers from India and the US, most of whom report English as their native language.

India accounts for a large proportion of Turkers translating across the Indian languages (Gujarati, Telugu, Tamil, Newar, Bengali, Punjabi, Hindi, Malayalam, Marathi, Kannada) as well as a few surprise languages (Norwegian, Kapampangan, Sicilian, and Asturian). Pakistan, Macedonia, and the Philippines took the reigns on translating their respective languages.
Some Turkers get really excited about our HITs, and decide to try a translating some languages that they may not exactly speak. So some of our HITs have a decent number of assignments submitted by Turkers claiming not to be in the country that javascript says they are...
...and instead to be in the country that conveniently speaks our HIT's language.



Luckily, these misreported assignments are attributable to just a handful of Turkers, suggesting that good quality control should be able to weed them out.


Sunday, December 30, 2012

When in Moldova, do as the Danish do

As promised, here is a breakdown of where the translators for each individual language are coming from. These are geolocations, not self-reported locations. I am hoping to do the same thing for self-reported HITs and for the misreported HITs only, but there is a little more cleaning that needs to be done to make the country names sync.

Some notes:

  • Telugu (te) and Tagalog (tl) are, logically, largely translated by people in India...
  • ...so are French (fr) and Serbo-Croatian (sh)
  • Danish (da) is almost exclusively translated by people in Moldova
  • Sudanese (su) is, of course, translated mostly by Pakistanis