Sunday, December 30, 2012

When in Moldova, do as the Danish do

As promised, here is a breakdown of where the translators for each individual language are coming from. These are geolocations, not self-reported locations. I am hoping to do the same thing for self-reported HITs and for the misreported HITs only, but there is a little more cleaning that needs to be done to make the country names sync.

Some notes:

  • Telugu (te) and Tagalog (tl) are, logically, largely translated by people in India...
  • ...so are French (fr) and Serbo-Croatian (sh)
  • Danish (da) is almost exclusively translated by people in Moldova
  • Sudanese (su) is, of course, translated mostly by Pakistanis




Everyone wishes they spoke waray-waray


I started looking at little closer at the turkers who misreport their countries, and addedd the new data files to the zip here.

Overall, we had 9072 assignments in which the country was misreported, which was attributable to 145 turkers. This is fairly minuscule, since we had over 300,000 assignments total, but I have figures, so I will report it anyway.

Most of the misreporters were located in Macedonia, most of them claimed to be in the US, and most of this misreporting happened on the Waray-waray translation HITs. I am currently working on breaking it out further, to see where exactly the waray-waray decoys are and where they claim to be from, and what percentage of waray-waray HITs they account for. I know, the suspense is killer. Try to hold tight...

Where they are...
Where they say the are... 
What language they are translating...



Saturday, December 29, 2012

Florida isn't a country

Today was an exciting day of manual data cleaning...everyone's favorite part of research. The result is this lovely bar graph, which shows the distribution of self-reported countries (top), compared to yesterday's distribution of javascript-geolocations (bottom). The conclusion, luckily, is that Turkers are good at reporting their current locations (our turkers are oriented and conscious...). 



I think these results could be more interesting when looking at HITs for specific languages- i.e. I think if we look at who is performing Icelandic translation HITs, there may be a surprising number of people in Bangalore. Hopefully I can have this analysis up tonight or tomorrow.

Things I learned today: NEVER allow open-ended responses on a survey. My favorite answers to 'which country are you currently living in?'. Can you guess these countries? (answers below)
  1. 33 
  2. rtsjrt
  3. Lives in India
  4. Florida
  5. k

Answers:
  1. Philippines
  2. India
  3. India
  4. US
  5. Greece






Friday, December 28, 2012

Tables of numbers

Happy Christmas, holidays, New Years, Decembers, Fridays, fewer-spam-emails-this-morning, or whatever you might be celebrating.

Picking up from a while back in the semester, I am looking again at Turker demographics, specifically their language abilities. In contrast to the analysis I had begun earlier, I am looking at things on a by-turker basis now. 

I generated a list of tuples of (turker_id, geo-location, self-reported language, self-reported country) which can be downloaded here. I weeded out turkers who reported more than one native language. This left 2652 turkers to study, less than half of all the turkers in the full dataset.

No language reported2556
One language reported2652
More languages reported828

Among turkers who reported multiple languages to be their native language, some reported only a modest two, others were bold and claimed more than 5 native languages. One turker went to town, and claimed a full 15 native languages, giving the EU a run for its money. Of those claiming multiple languages, English appears in nearly every list, accounting for most of the double-native-languagers.

Num languages reportedFrequency
2684
394
423
57
68
74
81
92
103
151


Here are distributions of the top 15 most represented countries and languages in terms of number of turkers.
15 most represented countries

15 most represented languages
As an interesting comparison to my work from earlier this semester, here is the by-country distribution in terms of number of HITs submitted (rather than number of turkers).

Most represented countries by number of HITs submitted

Also, out of curiosity, since English was the most common language while India is the most represented country, I checked the most represented countries among only English speakers:
15 most represented countries among self-reported English speakers