Summer 2012 Research Journal: November 2012

Wednesday, November 21, 2012

Actual Native Languages

Is response to Chris's comment, here is a better table of native languages (counted in number of assignments).

Urdu	3686
Macedonian	3604
Telugu	3589
Malayalam	3509
Romanian	3268
Tagalog	3103
Polish	3083
Marathi	2970
Spanish	2919
Hindi	2911
Portuguese	2823
Arabic	2722
Dutch	2594
Kannada	2526
Javanese	2521
Tamil	2457
Asturian	2319
Vietnamese	2311
Serbian	2213
Punjabi	2197
Russian	2145
Gujarati	2139
Nepali	2131
Newar / Nepal Bhasa	2111
Indonesian	2081
French	1931
Albanian	1894
Cebuano	1797
Serbo-Croatian	1768
Hungarian	1745
Malay	1672
Bengali	1636
Norwegian (Bokmal)	1606
Sicilian	1564
Japanese	1540
German	1519
Bishnupriya Manipuri	1417
Luxembourgish	1358
Ukrainian	1347
Italian	1333
Greek	1331
Kapampangan	1204
Belarusian	1154
Swedish	1147
Swahili	1047
Lithuanian	984
Croatian	981
Waray-Waray	975
Bulgarian	892
Turkish	828
Norwegian (Nynorsk)	729
Central_Bicolano	720
Slovenian	703
Bosnian	613
Hebrew	589
Chinese	552
Danish	535
Ilokano	527
Thai	508
Basque	403
Afrikaans	400
Czech	381
Finnish	380
Slovak	368
Catalan	329
Neapolitan	323
Georgian	321
Galician	299
Irish	260
Haitian	245
Korean	241
Armenian	199
Icelandic	191
Sindhi	191
Esperanto	181
Sundanese	148
Wolof	132
Pashto	119
Persian	115
Latvian	109
West Frisian	60
Breton	48
Somali	30
Malagasy	29
Welsh	28
Yoruba	25
Uzbek	25
Tibetan	20
Amharic	18
Piedmontese	5
Kurdish	2
Azerbaijani	1
Zazaki	1
Low Saxon	1
Walloon	0
Quechua	0
Aragonese	0
Tatar	0
Kazakh	0
Ido	0

Tuesday, November 20, 2012

More data on the Turkers:

We unfortunately didn't ask the Turkers what their native language was, so I am using the language in which the survey was written as best approximation. Here is list of number of surveys completed in each language. Not too informative, since it is more a measure of how many HITs were posed in each language.

Counts were 3750 for all countries not in the table. (i.e. Telugu Bulgarian French Tamil Haitian Nepali Sundanese Albanian Tagalog Serbian Malayalam Italian Norwegian (Nynorsk) Czech Slovak Urdu Polish Newar / Nepal Bhasa Swedish Marathi Slovenian Azerbaijani Danish Indonesian Galician Afrikaans Tibetan Central Uzbek Serbo-Croatian Punjabi Greek Kapampangan Latvian Croatian Arabic Breton Icelandic Turkish Waray-Waray Gujarati Hindi Vietnamese Hungarian Catalan Wolof Bosnian Lithuanian Malay Luxembourgish Russian Cebuano Bengali Norwegian (Bokmål) Javanese Irish Sicilian German West Frisian Belarusian Kannada Macedonian Basque Romanian Somali Ilokano Dutch Finnish Ukrainian Welsh Asturian Portuguese Spanish Esperanto)

Graphs of the Turkers' approval rates by country (top 25 countries by assignments) and by language (worst 40 languages).

For the lowest approval-rate languages, here is the a table of the primary location of Turkers completing that HIT, as well as average years of English and source language.

Tuesday, November 13, 2012

Portrait of the turker as a...young turker...

It has been a very long time, but I have finally had a free afternoon to do some non-class work. I'll save the blabbing about how busy this semester is for my other blog (the one that I update even less than this one) and instead talk about some Turker data analysis.

I am looking at the data that Dmitry left in his data base and pulling together some information about who our Turkers are and what we can say about their translation/language abilities. So far, I've made a few high-level graphs of Turkers' countries and reported language experience and their tendency to be approved/rejected by us. I am working on doing some more analysis to compare their native languages/countries to the languages in which they are actually performing HITs, and the corresponding approval/rejection ratios.

To start, distributions of numbers of workers by country. About half of all of our translators are in India...

Self-reported years of experience speaking English and speaking the source language (the language from which they are translating). This data was filtered to only include the people who responded with actual numbers. I had to throw out responses like 'Many', 'A lot' and my personal favorite, 'nil' (honesty always appreciated).

This is a scatter plot of Turkers' individual approval rates against their years speaking English. I was hoping it would show something more than it did, but I think the high number of Turkers who only submitted one or two HITs and got 100% approval makes the distribution somewhat unsatisfying.

As an alternative, I tried viewing the distribution of years of English separately among approved and rejected workers. The notable problem with this is that it is on a per-assignment basis, so if one worker submitted 100 assignments and had them all rejected, their years-English would count as 100 different data points in the rejected distribution. I don't think this is a drastic problem, since few workers have approval rates far below 50%, but I will try to find a cleaner way to represent it.

Wednesday, November 21, 2012

Actual Native Languages

Tuesday, November 20, 2012

Red Hot Data...

Tuesday, November 13, 2012

Portrait of the turker as a...young turker...