Friday, January 11, 2013

Graphs of slightly less ambiguous scores


I'm just going to go ahead and do a take-two on that last post, this time with a quality metric that might actually mean something. Or at one that we know what it is.

We had run some of the submitted translations through a second HIT, where the translations were compared against a known correct translation, and judged yes/no on whether the turker* translation and the known translation were synonyms. So the new way of reporting quality is the proportion of translations that were judged to be synonymous with a known answer. Some results are below. Note that even though I said judgements were yes/no, translations were judged on a yes/no/kinda scale, but I considered 'kinda' to be 'no.'

Scores are also a little (a lot) lower than one would ideally want them to be, but these tasks included a large number of technical words, which might not have been part of the average Turker's vocabulary, even if their language skills weren't bad. ('What do you mean you lived in Barcelona for 5 years and never learned the Spanish word for epipaleolithic?!?') Still, our graders were fairly liberal with their 'yes' votes. (Some good synonyms for 'star':  'STAR', 'Moon', and 'a star is a massive, luminous ball of plasma held together by gravity. the nearest star to earth is the sun, which is the source of most of the energy on earth. other stars are visible in the night sky, when they are not outshone by the sun. historically, the most prominent stars on the celestial sphere were grouped together into constellations and asterisms, and the brightest stars gained proper names. extensive catalogues of stars have been assembled by astronomers, which provide standardized star.')

Intuitvely, and encouragingly, HITs in languages with high numbers of native speakers had better-than average quality ratings.
Quality by HIT language

Turker-reported native language

Surprisingly, and not-so-encouragingly, countries with national languages for which we supplied many HITs did not consistantly do better than average. That is, even countries with ample opportunity to shine on their own native languages didn't produce markedly higher quality. One possible reason is that the Turkers from these well-represented countries got so excited about our HITs in their native languages, that they decided to roll the dice on some random languages like Waraywaray and Javanese. I'll poke a little more to see if I can back this up.

Quality by geolocation

Number of HITs per source language

As in the last post, there's a difference b/w misreporting turkers and honest turkers, although the difference is very small.

Avg. 99% Conf. Int.n
Overall 0.281(0.279, 0.282)307390
Misreport 0.252(0.244, 0.260)10479
Correct report 0.282(0.280, 0.283)296911



*I keep typing 'truker' instead of 'turker.' Perhaps there should be a MechanicalTruck, to give truckers something to do turning those long, boring rides. Data first, safety second. 

No comments:

Post a Comment