Wednesday, October 24, 2012

Our overly-bloggy controls

Chris observed that the sentences I showed in my last post seemed more like commentary than news. Looking into this, I realized I only ran the code on a subset of the sentences to save time, and it just happened that the sentences that I used to generate these examples were disproportionately taken from the newgroup/blog sources rather than the straight bbc sources (about 3 : 1). For these sentences, the google queries tended to return sites like bbc.co.uk/blogs. 

However, there were still a handful of BBC sentences (examples below) and they, too, appear to be drawing from blog sources. It could be that the blogs just generate much more traffic than the main news, so we are much more likely to hit blogs than real news. (Who are all these people clogging up our internet with their blogs? They should be stopped.) One option is to just block the bbc.co.uk/blogs from the search completely, or block it selectively only for the BBC source sentences.

Some examples of controls trying to match sentences from the BBC news sources:

140 year old, 9 kilo lobster Freed.
George the Giant Lobster was caught in the sea two weeks earlier.
A seafood company of the city had bought him for 100 dollars.
And immediately starting using the lobster as means of their advertisement.
The stock piling of Dollars.

They were aged between 29 to 74.
Why is there decrease in exercise with increase in age?
She has done extensive research on older age and the ability to exercise with increasing age.
A team of doctors used High Frequency rays to determine the functioning of the heart.
I suppose everyone has to counter their brain's tendency to adhere to a pattern determined roughly by age?

There has been a detailed report about this in the American Medical Association Publication.
Exercise is good for health in any age.
This makes your heart stronger, decreases cholesterol levels, helps in reducing weight, helps stop diabetes and decreases the chances of Alzheimer's.
Unfortunately, a lot of people tend to exercise less with age.
Unfortunately, PB, science doesn't work that way.



Thursday, October 18, 2012

More on controls (moron controls?)

Another round of pulling controls offline. This time, I used the Google Custom Search API, which is working much better than New York Times. The restriction is that you have to specify which sites you want to search during your queries; I suppose this is a limitation if you are trying to build a bigger and better search engine by piggybacking off of Google, but since we really just wanted to search BBC anyway, it works well. The other limitation is that the free account only gives you 100 queries per day. Granted, it is only $5 per 1000 queries after that, and I don't envision needing many more than 1500, since we only do about one query per HIT. For today, I just got around it by caching my 100 results and making fake google queries. It was that or register my turtle for a Google developer account...which I should probably do anyway.

Here are some sample sentences. I got these by using the top 5 words in each group of translations (by tf*idf) and then choosing the top sentence out of the pages returned (usually around 10 or 20 links). There are still a decent number of queries that return nothing; I think this is because the 5 query terms together are too restrictive. I may try querying for 5 for each HIT, but if nothing is returned, dropping it down to 3 and then to 1. I had also taken off the lower bound on length, so there seems to be some preference for short sentences, although it is not obvious from these examples. I also think I will try taking just the top one or two documents, rather than 10 or 20, mostly for speed up and because I don't image it will hurt quality.

  • What do you think what are the objectives of Taliban to end women education ?
  • Why Government is unable to stop these unlawful act.
  • What do you think what Government should do
  • Question is that what are the benefits for ISI and Pakistani Agencies.
  • So the rumor is planted with the new agencies.

  • For every action there is a reaction
  • You can tolerate till your own but
  • According to Democracy, Majority is the ruler of Nation and who are they picture is showing
  • The picture look like a picture of Family
  • Democracy, written communication, and philosophy were quite "rare" to them.

  • What will you going to do at Prime Minister's statement
  • Please give us your opinion
  • American President George W Bush in his last Press conference advises newly elected President that he Did many mistakes in his tenure and Barack Obama should do what he think is right
  • I as the President and according with the constitution of USA and law i did the developments, i put forward American interest on my own popularity.
  • If the Scottish National Party is paying attention to these developments, it should be encouraged, I venture to suggest, by what it may discover in so doing.

  • No one is going to say anything about the way that this historical figure is being falsely accused.
  • Some things are ignored because nothing can be done about them.
  • Sir/Madam, maybe if you read not only the history of Pakistan, but of it's neighboring four countries, or of seven other countries, then you might understand the meaning of Mr. Psycho.
  • One whose nature is wrong cannot be taught.
  • It's all poppycock. 

Wednesday, October 17, 2012

Another Attempt at Controls

Per our previous discussion, I began working on pulling control sentences from an external source, instead of using the professional translations. Since our data is English news data, and since New York Times has a nice open API, I decided to try to pull sentences from NYT and see if we could construct decent matches. 

My plan was:
- Take the four Turker translations we are trying to match
- Calculate the top k words from those translations by tf*idf score
- Query NYT using those k words as search terms
- Pull the returned articles and break them into sentences
- Use tf*idf overlap scoring we'd used on the wikipedia controls to find the best sentence

Reasonable as it seemed, this plan didn't produce great results. First, very few articles were returned from my queries; most searches returned nothing. I'd originally set k to 5, but this was apparently too restrictive for the search. I was able to get results if I dropped k to 1, but, needless to say, that significantly hurt the matchiness of the sentences returned (see below). Second, the query searches the entire text of the articles but does not return the full text. So the sentences that are actually returned as candidate control sentences are only those from the first paragraph of the matching article. I think it would be worth looking into Google's API instead. I expect it will give better results for higher values of k, and would return full texts of articles.

Here are some example control sentences returned. These are with k = 2, so probably not even worth posting except that they are so amusing.

My personal favorite:
  • But a group for animal rights, "Pita", demanded the lobster's release and now he has been released in the sea again.
  • George was caught in a sea in Canada, after which he spent ten days in the tank of "City Crab and Seafood Company."
  • The age of a lobster can usually be determined by its weight.
  • According to the restaurant manager, Keith Wallanty, they did not mean any harm to the lobster, and put him in the tank only to arouse curiosity amongst their customers.
  • Ten minutes later, I was in the land of carrots.

  • Isn't all this terrorism too? Why does only ones pain seem painful?
  • Aren't other people human too? So why is only Muslim blood so important? Wasn't all the innocent people's blood red too?
  • So why are there no disapproval messages now?
  • I watched the Mumbai attacks investigations like the fire was in my own home. I cannot watch this human suffering. Can't everyone else see this suffering? I wish Muslims all over could come together and think about this.
  • Hacipasa, Turkey - The men stood at the road's edge and watched the war that is inching ever closer to home.

This one is a classic New York Timesy sentence. So theatrical. They can't just open with a fact like Bloomberg or WSJ:
  • In which, the ability of the heart to pump blood, and it's ability to shrink and relax was researched.
  • Experts have concentrated mainly on the left side of the heart which is mainly responsible for the pumping of blood.
  • They discovered that with increase in age, the problems in the rate of heart relaxation also increase.
  • Dr. Patricia says that with increase in age, the problems in the heart relaxation rates increases and thus effects the ability to exercise with growing age.
  • A steady rain was soaking the windows of La Guardia airport when Nancy Thode, an elite frequent flier with Delta air lines, approached a gate agent with a pressing question: had her request for an upgrade cleared?

  • Dr. Carwin Cain was also part of this research.
  • He says that it's surprising to see the permanent effects of improper exercise on the heart.
  • In some instances, blood pressure is a factor in heart anomalies and heart diseases.
  • If doctors and patients can work together when it comes to heart diseases, then they can enjoy the benefits of exercise even with older age.
  • To buy doctors lunch or dinner, for example, but tempting them with lavish gifts was taboo.


Tuesday, October 16, 2012

New Data and New HITs Followup

I regenerated the editing HITs, this time using 5 sentence groups that contain 4 contiguous Turker translations (from the same Turker and the same source document) and one professional control (chosen from the same document the Turker's translations, but a different sentence). Both versions are currently up on the sandbox siteI am going to try finding other controls besides the professional translations, and will hopefully have a version 3 up soon. (Version threes are always the best versions.)

Examples of version 2:


Everything is possible in America
But by spite of this , I are hopeful that the dream of our founders will liveed .
According to management two million people are attending this gathering
Monday is a national holiday in America because on this day one of the greatest saviour of human rights, Martin Luther King was being assisinated
Martin Luther King was black in race

Britains Secretary of State said in front of media in Mumbai and also wrote that the idea of war against terrorism is wrong.Killing people for the sake of threats being faced had been the policy of President Bush
It is to be noted on newly electes American President Obama aslo expressed such veiws of Kashmir .
David Milliband said that Pakistan should not have any soft corner for the extremist groups
Is India creating uncertainity in the region to avoid the resolution of Kashmir issue by big powers?
Are the Mumbai attacks the part of the plan?

Pubslishing something from their own sounds far, they even din't publish the discussions.
And my question to you is do the teachings of Islam , or those of the East or West give any logic , philosophy , or reasoning that approves of the killing in innocent citizens , including people who are praying , laborers , and low-income workers , and employees , by an consequence of American attacks ?
On be half of the muslims
I have got a single question from all of the human right champions.
It is impossible for them to stand infront of you with neclaces if you kill their childeren.

Monday, October 15, 2012

New Data and New HITs

Okay...not new HITs. The same old, recylced HITs. But new data, and that is what really matters.

I reposted the postediting HIT on the sandbox site using Omar's data set, which contains four professional translations in addition to four turker translations for each sentence. The HITs are set up now to show 5 versions of the same sentence, four of which are Turker translations and one of which is a professional translation (randomly chosen) which has errors added. I think this grouping is nice, since the professional sentence tends to blend nicely and in gives the Turkers a lot of context, which makes editing easier.

I am also going to try a grouping that gives four continuous, different sentences from the document, one of which is an errorified professional translation. My instinct is that this will not allow the control to blend as well, but I will try both and see what the consensus is on which one is better ('consensus' being an unbiased poll of an iid draw from the population, i.e. the majority opinion of myself, my boyfriend, Chris, and Matt).

Here are some example groupings under the current set up (controls in red). These and the 97 other HITs are on the sadbox site, and can be found if you search for 'Ellie'.

He said that both countries have extremely different history.
He said that the histories of both countries were extremely different. He said that both countries have entirely different history.
He saids for the two countries have different historical backgrounds . 
He said that the history of both the countries is different.

Brothers, the end result of killing innocent children is destruction.
brothers, the only end to the killer of innocent children is destruction.
Brothers, the consequence of a murderer of children is disaster.
Brothers, the fate of the murderer of innocent children is annihilation.
Brohters , an fate in the killers of innocent children was destruction . 

Seventeenth Ammendment cannot be abolished till Zardari is the President.
as long as Zardari is the president, seventeenth amendment is not going to be cancelled.
17th amendment will not be canceled until Zardari is the President.
As long as the Zardari is preisdent , a seventeenth amendment is not with to be annulling . 
As long Zardari is President 17th amendment will be there.

Tuesday, October 2, 2012

Next Steps...

After a little break as the semester has been starting up, I am returning to my work on this project. My summer work ended up being a lot of infrastructure-building, more than I had originally forseen. Chris and I decided the best thing to do at this point, before going forward, is to step back and try to decide exactly what questions we want to answer with all of this infrastructure, and which experiments we need to run in order to answer them.

At the highest level, the point of all of this work is to use Mturk to get cheaper translation data without taking a hit on quality. (Pun intended.) From this standpoint, the two broad questions we want to answer are:
  1. Does post-editing translations improve translation quality?
  2. Can we automate the quality control process, and save the cost of paying for poor work/QC HITs?
To answer these questions, we need a good measure of translation quality that is intuitive. Following Chris and Omar's paper, it seems logical to use human judgments: we ask Turkers to rank translations, and assume that the most highly ranked translations are better.  We should also have a measure that is automatic and unambiguous, which could be used in automated processes and doesn't rely on Turkers for ranking: TER is a good, easy-to-compute, and widely-used metric for this. It would also be good to have a gold standard, so we can have a sense of 'how much better' one translation is than another. Again, piggy backing off of Omar, we can use his data set which consists of both Turker translations and professional translations of the same sentences. 

Then our question (1) can be  made more concrete:
  • Does post-editing translations improve translation quality?
    1. When ranked by Turkers, are post-edited translations ranked higher than their pre-edited versions?
    2. When calculated against the professional translation, is the TER for post-edited translations lower than for their pre-edited versions?
Question (2) is a little more complicated. TER is a good, automatic metric, but we obviously don't have professional translations against which to compare all our Turker translations during QC. We instead can use an embedded control sentence for which we have a gold standard, and hope that the TER score of a Turker's version of the control against the gold standard is a good indicator of the quality of the rest of the Turker's work. We can test this by comparing the TER scores for a Turker's control with the rankings, as judged by other Turkers. 

Then our question (2) can be made more concrete:
  • Can we automate the quality control process, and save the cost of paying for poor work/QC HITs?
    1. Do Turkers who get low TER scores for their control sentences tend to have their non-control sentences ranked higher? 
    2. Do Turkers who get low TER scores for their control sentences tend to have lower TER scores for their non-control sentences, when calculated against the professional translations.
Luckily, these questions mostly reuse different views of the same data. I think all of the data can be collected in two HITs:

HIT 1: The ESL HIT we already have (the one that has been the star of this blog for the past three months). This HIT will need to be populated with data for which we have professional, gold standard sentences (i.e. Omar's data). The HIT should then have 5 sentences, 4 which are Turker translations taken from Omar's data set, and one which is an artificially created control sentence (more questions about this are below). Omar's data has four translations for each sentence/professional translation. This HIT will produce a data set which has four post-edited versions for each of those four translations. 

HIT 2: Workers are presented with 6 versions of a sentence: one professional version, four Turker post-edited Turker translated versions, and one pre-edited Turker translated version. They will be asked to rank the sentences in order from best to worst. We can expect (and can use as a form of QC) that the professional translation is ranked best, with the post edits in the middle, and hopefully the pre-edited version last. This HIT will produce a data set which has, for a given translation in Omar's data set, a relative ranking of the four Turkers who post-edited that translation. 

Using the output of these two HITs, we can (hopefully) show that post-edited sentences are better than their pre-edited versions, in terms of human judgments and in terms of TER score against the professional version. We can also, for all HITs, plot the TER score on control sentence against average human ranking across the other sentences, and hopefully show that average rankings increase as TER scores on controls decrease. This would suggest that using TER on a control sentence is a good automatic filter for the quality of the worker overall. 

One implementation question I still have is how to choose the control sentences to embed into the HIT. Whereas the data set I was using this summer came with Wikipedia documents attached, and we could pull control sentences this way, Omar's data set does not have this. One idea is to use the gold-standard professional translation with introduced errors as the control; I would worry that this could affect the TERs of the translations against the professional that we want to calculate later, since the Turkers will have, in a sense, seen the professional version as they were editing the nonprofessional versions. I don't know that this would cause a problem, but it is a thought. Another option is to randomize the sentences, which removes the need to 'match context' between the controls and the rest of the HIT. The problem is that this might increase the difficulty of the task.  I would be interested in Chris or Matt's opinions on this? Or my Mom's...since I know she reads this too. All suggestions welcome. :-)