After a little break as the semester has been starting up, I am returning to my work on this project. My summer work ended up being a lot of infrastructure-building, more than I had originally forseen. Chris and I decided the best thing to do at this point, before going forward, is to step back and try to decide exactly what questions we want to answer with all of this infrastructure, and which experiments we need to run in order to answer them.
At the highest level, the point of all of this work is to use Mturk to get cheaper translation data without taking a hit on quality. (Pun intended.) From this standpoint, the two broad questions we want to answer are:
- Does post-editing translations improve translation quality?
- Can we automate the quality control process, and save the cost of paying for poor work/QC HITs?
To answer these questions, we need a good measure of translation quality that is intuitive. Following Chris and Omar's
paper, it seems logical to use human judgments: we ask Turkers to rank translations, and assume that the most highly ranked translations are better. We should also have a measure that is automatic and unambiguous, which could be used in automated processes and doesn't rely on Turkers for ranking: TER is a good, easy-to-compute, and widely-used metric for this. It would also be good to have a gold standard, so we can have a sense of 'how much better' one translation is than another. Again, piggy backing off of Omar, we can use his data set which consists of both Turker translations and professional translations of the same sentences.
Then our question (1) can be made more concrete:
- Does post-editing translations improve translation quality?
- When ranked by Turkers, are post-edited translations ranked higher than their pre-edited versions?
- When calculated against the professional translation, is the TER for post-edited translations lower than for their pre-edited versions?
Question (2) is a little more complicated. TER is a good, automatic metric, but we obviously don't have professional translations against which to compare all our Turker translations during QC. We instead can use an embedded control sentence for which we have a gold standard, and hope that the TER score of a Turker's version of the control against the gold standard is a good indicator of the quality of the rest of the Turker's work. We can test this by comparing the TER scores for a Turker's control with the rankings, as judged by other Turkers.
Then our question (2) can be made more concrete:
- Can we automate the quality control process, and save the cost of paying for poor work/QC HITs?
- Do Turkers who get low TER scores for their control sentences tend to have their non-control sentences ranked higher?
- Do Turkers who get low TER scores for their control sentences tend to have lower TER scores for their non-control sentences, when calculated against the professional translations.
Luckily, these questions mostly reuse different views of the same data. I think all of the data can be collected in two HITs:
HIT 1: The ESL HIT we already have (the one that has been the star of this blog for the past three months). This HIT will need to be populated with data for which we have professional, gold standard sentences (i.e. Omar's data). The HIT should then have 5 sentences, 4 which are Turker translations taken from Omar's data set, and one which is an artificially created control sentence (more questions about this are below). Omar's data has four translations for each sentence/professional translation. This HIT will produce a data set which has four post-edited versions for each of those four translations.
HIT 2: Workers are presented with 6 versions of a sentence: one professional version, four Turker post-edited Turker translated versions, and one pre-edited Turker translated version. They will be asked to rank the sentences in order from best to worst. We can expect (and can use as a form of QC) that the professional translation is ranked best, with the post edits in the middle, and hopefully the pre-edited version last. This HIT will produce a data set which has, for a given translation in Omar's data set, a relative ranking of the four Turkers who post-edited that translation.
Using the output of these two HITs, we can (hopefully) show that post-edited sentences are better than their pre-edited versions, in terms of human judgments and in terms of TER score against the professional version. We can also, for all HITs, plot the TER score on control sentence against average human ranking across the other sentences, and hopefully show that average rankings increase as TER scores on controls decrease. This would suggest that using TER on a control sentence is a good automatic filter for the quality of the worker overall.
One implementation question I still have is how to choose the control sentences to embed into the HIT. Whereas the data set I was using this summer came with Wikipedia documents attached, and we could pull control sentences this way, Omar's data set does not have this. One idea is to use the gold-standard professional translation with introduced errors as the control; I would worry that this could affect the TERs of the translations against the professional that we want to calculate later, since the Turkers will have, in a sense, seen the professional version as they were editing the nonprofessional versions. I don't know that this would cause a problem, but it is a thought. Another option is to randomize the sentences, which removes the need to 'match context' between the controls and the rest of the HIT. The problem is that this might increase the difficulty of the task. I would be interested in Chris or Matt's opinions on this? Or my Mom's...since I know she reads this too. All suggestions welcome. :-)