| Urdu | 3686 |
| Macedonian | 3604 |
| Telugu | 3589 |
| Malayalam | 3509 |
| Romanian | 3268 |
| Tagalog | 3103 |
| Polish | 3083 |
| Marathi | 2970 |
| Spanish | 2919 |
| Hindi | 2911 |
| Portuguese | 2823 |
| Arabic | 2722 |
| Dutch | 2594 |
| Kannada | 2526 |
| Javanese | 2521 |
| Tamil | 2457 |
| Asturian | 2319 |
| Vietnamese | 2311 |
| Serbian | 2213 |
| Punjabi | 2197 |
| Russian | 2145 |
| Gujarati | 2139 |
| Nepali | 2131 |
| Newar / Nepal Bhasa | 2111 |
| Indonesian | 2081 |
| French | 1931 |
| Albanian | 1894 |
| Cebuano | 1797 |
| Serbo-Croatian | 1768 |
| Hungarian | 1745 |
| Malay | 1672 |
| Bengali | 1636 |
| Norwegian (Bokmal) | 1606 |
| Sicilian | 1564 |
| Japanese | 1540 |
| German | 1519 |
| Bishnupriya Manipuri | 1417 |
| Luxembourgish | 1358 |
| Ukrainian | 1347 |
| Italian | 1333 |
| Greek | 1331 |
| Kapampangan | 1204 |
| Belarusian | 1154 |
| Swedish | 1147 |
| Swahili | 1047 |
| Lithuanian | 984 |
| Croatian | 981 |
| Waray-Waray | 975 |
| Bulgarian | 892 |
| Turkish | 828 |
| Norwegian (Nynorsk) | 729 |
| Central_Bicolano | 720 |
| Slovenian | 703 |
| Bosnian | 613 |
| Hebrew | 589 |
| Chinese | 552 |
| Danish | 535 |
| Ilokano | 527 |
| Thai | 508 |
| Basque | 403 |
| Afrikaans | 400 |
| Czech | 381 |
| Finnish | 380 |
| Slovak | 368 |
| Catalan | 329 |
| Neapolitan | 323 |
| Georgian | 321 |
| Galician | 299 |
| Irish | 260 |
| Haitian | 245 |
| Korean | 241 |
| Armenian | 199 |
| Icelandic | 191 |
| Sindhi | 191 |
| Esperanto | 181 |
| Sundanese | 148 |
| Wolof | 132 |
| Pashto | 119 |
| Persian | 115 |
| Latvian | 109 |
| West Frisian | 60 |
| Breton | 48 |
| Somali | 30 |
| Malagasy | 29 |
| Welsh | 28 |
| Yoruba | 25 |
| Uzbek | 25 |
| Tibetan | 20 |
| Amharic | 18 |
| Piedmontese | 5 |
| Kurdish | 2 |
| Azerbaijani | 1 |
| Zazaki | 1 |
| Low Saxon | 1 |
| Walloon | 0 |
| Quechua | 0 |
| Aragonese | 0 |
| Tatar | 0 |
| Kazakh | 0 |
| Ido | 0 |
Wednesday, November 21, 2012
Actual Native Languages
Is response to Chris's comment, here is a better table of native languages (counted in number of assignments).
Tuesday, November 20, 2012
Red Hot Data...
More data on the Turkers:
We unfortunately didn't ask the Turkers what their native language was, so I am using the language in which the survey was written as best approximation. Here is list of number of surveys completed in each language. Not too informative, since it is more a measure of how many HITs were posed in each language.
Counts were 3750 for all countries not in the table. (i.e. Telugu Bulgarian French Tamil Haitian Nepali Sundanese Albanian Tagalog Serbian Malayalam Italian Norwegian (Nynorsk) Czech Slovak Urdu Polish Newar / Nepal Bhasa Swedish Marathi Slovenian Azerbaijani Danish Indonesian Galician Afrikaans Tibetan Central Uzbek Serbo-Croatian Punjabi Greek Kapampangan Latvian Croatian Arabic Breton Icelandic Turkish Waray-Waray Gujarati Hindi Vietnamese Hungarian Catalan Wolof Bosnian Lithuanian Malay Luxembourgish Russian Cebuano Bengali Norwegian (Bokmål) Javanese Irish Sicilian German West Frisian Belarusian Kannada Macedonian Basque Romanian Somali Ilokano Dutch Finnish Ukrainian Welsh Asturian Portuguese Spanish Esperanto)
Graphs of the Turkers' approval rates by country (top 25 countries by assignments) and by language (worst 40 languages).
For the lowest approval-rate languages, here is the a table of the primary location of Turkers completing that HIT, as well as average years of English and source language.
We unfortunately didn't ask the Turkers what their native language was, so I am using the language in which the survey was written as best approximation. Here is list of number of surveys completed in each language. Not too informative, since it is more a measure of how many HITs were posed in each language.
Counts were 3750 for all countries not in the table. (i.e. Telugu Bulgarian French Tamil Haitian Nepali Sundanese Albanian Tagalog Serbian Malayalam Italian Norwegian (Nynorsk) Czech Slovak Urdu Polish Newar / Nepal Bhasa Swedish Marathi Slovenian Azerbaijani Danish Indonesian Galician Afrikaans Tibetan Central Uzbek Serbo-Croatian Punjabi Greek Kapampangan Latvian Croatian Arabic Breton Icelandic Turkish Waray-Waray Gujarati Hindi Vietnamese Hungarian Catalan Wolof Bosnian Lithuanian Malay Luxembourgish Russian Cebuano Bengali Norwegian (Bokmål) Javanese Irish Sicilian German West Frisian Belarusian Kannada Macedonian Basque Romanian Somali Ilokano Dutch Finnish Ukrainian Welsh Asturian Portuguese Spanish Esperanto)
Graphs of the Turkers' approval rates by country (top 25 countries by assignments) and by language (worst 40 languages).
For the lowest approval-rate languages, here is the a table of the primary location of Turkers completing that HIT, as well as average years of English and source language.
Tuesday, November 13, 2012
Portrait of the turker as a...young turker...
It has been a very long time, but I have finally had a free afternoon to do some non-class work. I'll save the blabbing about how busy this semester is for my other blog (the one that I update even less than this one) and instead talk about some Turker data analysis.
This is a scatter plot of Turkers' individual approval rates against their years speaking English. I was hoping it would show something more than it did, but I think the high number of Turkers who only submitted one or two HITs and got 100% approval makes the distribution somewhat unsatisfying.
I am looking at the data that Dmitry left in his data base and pulling together some information about who our Turkers are and what we can say about their translation/language abilities. So far, I've made a few high-level graphs of Turkers' countries and reported language experience and their tendency to be approved/rejected by us. I am working on doing some more analysis to compare their native languages/countries to the languages in which they are actually performing HITs, and the corresponding approval/rejection ratios.
To start, distributions of numbers of workers by country. About half of all of our translators are in India...
Self-reported years of experience speaking English and speaking the source language (the language from which they are translating). This data was filtered to only include the people who responded with actual numbers. I had to throw out responses like 'Many', 'A lot' and my personal favorite, 'nil' (honesty always appreciated).
As an alternative, I tried viewing the distribution of years of English separately among approved and rejected workers. The notable problem with this is that it is on a per-assignment basis, so if one worker submitted 100 assignments and had them all rejected, their years-English would count as 100 different data points in the rejected distribution. I don't think this is a drastic problem, since few workers have approval rates far below 50%, but I will try to find a cleaner way to represent it.
Subscribe to:
Posts (Atom)








