Wednesday, November 21, 2012

Actual Native Languages

Is response to Chris's comment, here is a better table of native languages (counted in number of assignments).

Urdu3686
Macedonian3604
Telugu3589
Malayalam3509
Romanian3268
Tagalog3103
Polish3083
Marathi2970
Spanish2919
Hindi2911
Portuguese2823
Arabic2722
Dutch2594
Kannada2526
Javanese2521
Tamil2457
Asturian2319
Vietnamese2311
Serbian2213
Punjabi2197
Russian2145
Gujarati2139
Nepali2131
Newar / Nepal Bhasa2111
Indonesian2081
French1931
Albanian1894
Cebuano1797
Serbo-Croatian1768
Hungarian1745
Malay1672
Bengali1636
Norwegian (Bokmal)1606
Sicilian1564
Japanese1540
German1519
Bishnupriya Manipuri1417
Luxembourgish1358
Ukrainian1347
Italian1333
Greek1331
Kapampangan1204
Belarusian1154
Swedish1147
Swahili1047
Lithuanian984
Croatian981
Waray-Waray975
Bulgarian892
Turkish828
Norwegian (Nynorsk)729
Central_Bicolano720
Slovenian703
Bosnian613
Hebrew589
Chinese552
Danish535
Ilokano527
Thai508
Basque403
Afrikaans400
Czech381
Finnish380
Slovak368
Catalan329
Neapolitan323
Georgian321
Galician299
Irish260
Haitian245
Korean241
Armenian199
Icelandic191
Sindhi191
Esperanto181
Sundanese148
Wolof132
Pashto119
Persian115
Latvian109
West Frisian60
Breton48
Somali30
Malagasy29
Welsh28
Yoruba25
Uzbek25
Tibetan20
Amharic18
Piedmontese5
Kurdish2
Azerbaijani1
Zazaki1
Low Saxon1
Walloon0
Quechua0
Aragonese0
Tatar0
Kazakh0
Ido0

Tuesday, November 20, 2012

Red Hot Data...

More data on the Turkers:

We unfortunately didn't ask the Turkers what their native language was, so I am using the language in which the survey was written as best approximation. Here is list of number of surveys completed in each language. Not too informative, since it is more a measure of how many HITs were posed in each language.


Counts were 3750 for all countries not in the table. (i.e. Telugu Bulgarian French Tamil Haitian Nepali Sundanese Albanian Tagalog Serbian Malayalam Italian Norwegian (Nynorsk) Czech Slovak Urdu Polish Newar / Nepal Bhasa Swedish Marathi Slovenian Azerbaijani Danish Indonesian Galician Afrikaans Tibetan Central Uzbek Serbo-Croatian Punjabi Greek Kapampangan Latvian Croatian Arabic Breton Icelandic Turkish Waray-Waray Gujarati Hindi Vietnamese Hungarian Catalan Wolof Bosnian Lithuanian Malay Luxembourgish Russian Cebuano Bengali Norwegian (BokmÃ¥l) Javanese Irish Sicilian German West Frisian Belarusian Kannada Macedonian Basque Romanian Somali Ilokano Dutch Finnish Ukrainian Welsh Asturian Portuguese Spanish Esperanto)


Graphs of the Turkers' approval rates by country (top 25 countries by assignments) and by language (worst 40 languages).



For the lowest approval-rate languages, here is the a table of the primary location of Turkers completing that HIT, as well as average years of English and source language.


Tuesday, November 13, 2012

Portrait of the turker as a...young turker...

It has been a very long time, but I have finally had a free afternoon to do some non-class work. I'll save the blabbing about how busy this semester is for my other blog (the one that I update even less than this one) and instead talk about some Turker data analysis.

I am looking at the data that Dmitry left in his data base and pulling together some information about who our Turkers are and what we can say about their translation/language abilities. So far, I've made a few high-level graphs of Turkers' countries and reported language experience and their tendency to be approved/rejected by us. I am working on doing some more analysis to compare their native languages/countries to the languages in which they are actually performing HITs, and the corresponding approval/rejection ratios.

To start, distributions of numbers of workers by country. About half of all of our translators are in India...
Self-reported years of experience speaking English and speaking the source language (the language from which they are translating). This data was filtered to only include the people who responded with actual numbers. I had to throw out responses like 'Many', 'A lot' and my personal favorite, 'nil' (honesty always appreciated).


This is a scatter plot of Turkers' individual approval rates against their years speaking English. I was hoping it would show something more than it did, but I think the high number of Turkers who only submitted one or two HITs and got 100% approval makes the distribution somewhat unsatisfying.

As an alternative, I tried viewing the distribution of years of English separately among approved and rejected workers. The notable problem with this is that it is on a per-assignment basis, so if one worker submitted 100 assignments and had them all rejected, their years-English would count as 100 different data points in the rejected distribution. I don't think this is a drastic problem, since few workers have approval rates far below 50%, but I will try to find a cleaner way to represent it.