Summer 2012 Research Journal: August 2012

Wednesday, August 29, 2012

Finalizing ESL HIT

I am working on some final kinks in the ESL HIT with the intention of posting a big trial of about 1000 HITs tomorrow (the control sentences are compiling right now...slowly I might add...). The summer is winding up, and its about time to stabilize this HIT and let it start running and collecting data on its own in the real world (...I can't hold its hand forever...). The goal is to have a large number of HITs posted before classes start so that it can be left alone for a while.

1. The main change I've been making is fixing up the control-grading scripts. After my last HIT's snafu* (I accidentally rejected a handful of legitimate workers- I felt like an evil robber baron, selfishly denying my workers pay, I was ready for the revolution to begin at my doorstep) I did some serious reworking of my grading algorithm. Before, I'd been doing a simple comparison in which I just compared two words at the same index:

So if the control sentence was I went to the store.

and we dropped "the" to make it I went to -- store.

and the Turker added "the" back in I went to the store.

I had been simply checking that the forth word in the Turker's sentence was the same forth word that was deleted from the original sentence, and giving them points if it matched. This was a far too basic way of grading. I was assuming that Turkers would make almost no extra edits, and that they would edit in the same order in which I had generated errors. In short, it was a very weak algorithm.

I switched to using the data structure I've described in an earlier post, so that I now trace each initial word to its final word. This requires a lot more code and bookkeeping, but is much more robust to varying orders of edits and extraneous changes in the sentence.

2. Customized feedback for workers: Rather than just saying "Sorry," we decided to give Turkers helpful feedback when their results don't pass our QC. Now our response includes the control sentence, and the answer we were expecting to receive from them.

3. Reposting HITs once if they are rejected, so that we can be sure we are getting four high-quality responses for each HIT.

And, as always, sundry small changes throughout. With luck, the HIT will be up within the next 12 hours.

*This led to an adventure in the Java MTurk APIs. I will describe the process and post the code tomorrow in a highly-anticipated post: "Rejection and Regret: How to Repay Rejected Turkers and Rekindle Long Lost Data-Acquisition Relationships." Stay tuned...

Thursday, August 23, 2012

Worker Performance against Number of HITs Submitted

Following Chris's suggestion, here is a plot of Turkers' average accuracy against the number of HITs they submit. Unfortunately, our sample is very skewed - most Turkers only submitted one or two HITs, and handful of Turkers took care of the rest of the HITs.

Wednesday, August 22, 2012

Control-Only HIT : First Results

Results are in for our performance-only HIT, and I have some preliminary analysis. I haven't gotten a chance to run statistics on agreement or break down the results by POS, but I have some good descriptions of how Turkers are performing in terms basic accuracy measurements.

Surprisingly, Turkers seem to make slightly more corrections than necessary. On average, Turkers make 4.59 changes per sentence, whereas there are only 4.52 errors per sentence.

This makes sense, since many Turkers end up making small word-choice corrections which are not actual errors, or misunderstanding the meaning of the sentence and having to make extra changes to compensate. For example, below, the punctuation change was a matter or taste, not necessity. (Ignore the random blue nodes...I have been lazy about fixing up my figure generating script...)

Turkers' accuracy overall is hovering somewhere around 40% (they correct about 40% of the errors that are actually present). I tried calculating the accuracy in two ways:

Identifying the right type of error in the right location counts as full credit (blue). I.e. if we introduced an error by changing a preposition, and the Turker changed the same preposition, they get full credit.
Identifying the right error in the right location gives half credit; correcting it back to the original version gives the other half credit (green). I.e. if we introduced an error by changing "of" to "in" and the Turker changes "in" to "on", they get half credit; if they change "in" to "of", they get full credit.

Intuitively, the average for the latter is slightly lower. Encouragingly, the distributions are very close.

I liked the stricter measurement of accuracy better, so here it is isolated. (Blue in the lower figure = green in the upper figure.)

Some errors are easier to correct than others. Accuracy for insertion errors was the highest (errors requiring Turkers to remove unnecessary articles) and for deletion errors was the lowest (requiring Turkers to add in missing articles). Changing showed the most discrepancy between the different accuracy calculations (usually because changes were introduced with prepositions and picking the correct preposition leaves room for interpretation).

Changes had the smoothest distributions - again, likely because of the high variability in choosing the new word to which to change the old word. Deletions and insertions were more bipolar, with Turkers finding either all or none of the errors.

Monday, August 20, 2012

Creating, Managing, and Controlling External HITs

It has been a little while since my last post, mostly because I've been caught up on a generating and grading the controls for our HIT. The up side is that I've had steady work, lots of code to write, and some quality bonding time with NLTK. The down side is that I have to now attempt not to blitzkrieg you with a discussion 10 days worth of code at a painful level of detail. As a compromise, I want to lay out a broad overview of my contribution to Dmitry's git repo, and explain how I've extended the existing code to support our ESL HIT. I suppose this can also be my tribute to Dmitry and a send-off of sorts as he goes on to bigger and better things. I will try not to become too sentimental...

The general pipeline for creating and running external HITs off of our server (also laid out in the README on git) is as follows:

load data into databases
divide the data into batches (e.g. 5 sentences per batch for the ESL HIT) generate one HIT per batch
add the HITs to Mturk
wait for Turkers to do HITs, hit refresh constantly in browser and hope for more results
retrieve* all HIT data from Mturk and load into a buffer in database
read data from buffer into relevant database tables
grade controls for each assignment
approve/reject based on Turkers' performance

This is a general framework, and describes all the HITs currently in the repo (including the ESL hit, a hit for translating Spanish tweets, and a hit for determining whether two translations have similar meanings). I will discuss the above pipeline in the specific context of the ESL HIT, and try to point out which aspects are unique to ESL and which are common between all HITs.

1. Loading Data into Database (scripts: esl.sql, load_data_to_db.py)

There are nine tables in the ESL database:

esl_sentences

hits

esl_assignments

esl_hits_data

esl_hits_results

esl_location

esl_workers

esl_controls

esl_edits

The first (esl_sentences) simply stores all the sentences we are using to generate our HITs; all the hits have an analogous table containing the necessary raw data. The next six tables (hits, assignments, hits_data, hits_results, location, and workers) are also tables which appear analogously in all HITs. The hits table is the most central, and contains Mturk's HIT ids and internal database HIT ids for all the hits. This one is used to link each HIT to the data it contains, the assignments it spurred, the workers assigned to it, etc (see figure below). The final two (controls and edits) store data specific to the ESL HIT; controls holds information about the automatically-generated errors added to our control sentences and edits holds information about each atomic edit made by the Turkers.

The esl_edits table is one of the places that I have broken rank with the other HITs in the repo. While I have an esl_hits_results table, I have not actually been using it, but rather have been storing my results in esl_edits. I thought it was easier and neater to keep the data this way that to add columns to the hits_results table. I have also been ignoring the esl_locations table so far (which will hold data taken from cookies and surveys about educational/language background), but I will probably start using it soon.

All the tables should be built initially by running the esl.sql script (although my script is not up to data, since I've been adding and removing columns from the command line). The load_data_to_db.py script simply populates the esl_sentences table by reading from a .csv file; all other tables are populated later in the pipeline. (Reading the .csv file and populating the table is one area that has to be tweaked for each different HIT, since all have different input data.)

2. Generating HITs (script: generate_esl_hits.py, generate_cntrlonly_hits.py, controls.py, generrors.py)

HITs are generated by reading data out of the esl_sentences table, batching the data into bite-sized HITs, and creating entries in both the hits and the esl_hits_data tables to reflect the existence of that HIT. The HIT does not have an Mturk HIT id when it is created (it is not assigned until it has been added to Mturk). In other words, after running either generate_hits.py script, the tables will show one row corresponding to each future HIT, but will have a blank mturk_id entry.

The ESL generate scripts are similar to the other HITs' scripts except that they contain extra code for inserting controls into the HITs. Our ESL controls are created using English sentences that depend on the non-control sentences in the HIT, so it is necessary to create the controls dynamically as the HITs are generated. The generate script takes all the sentences from the esl_sentences table, sorts them by document, splits them into batches of four, and then runs the four sentences through a separate control pipeline (subject of a future post). This control pipeline creates a control sentence, enters that control sentence into the esl_sentences table, and returns the id of the new esl_sentences entry. The generate script creates a new HIT which contains the for original sentences plus the new control.

3. Adding the HITs to Mturk (script: add_esl_hits_to_mturk.py)

I left this script untouched, literally copied and pasted from the other HITs. It uses Dimitry's wrapper around the boto API to add the HITs to Mturk. It retrieves the newly assigned Mturk HIT id and fills in the empty column of the hits table in the database.

4. Waiting (or busy waiting, depending on how often you hit refresh...) (scripts: main.py, esl.tpl, ESLHIT.js)

As Turkers accept HITs, they are populated with the correct sentences (those that were determined during the generate script) using some wonderfully modular python templating libraries. The code in main.py is another area that has to be changed in order to add a new HIT. Each HIT has a .tpl template file, which is largely in html and javascript, but contains fragments of python code which can dynamically populate data based on parameters that are passed to the template.

For example, the ESL HIT populates its 5 sentences dynamically from the esl_sentences database. The sentences are passed to the template in main.py (see line 316):

@route('/hits/esl/<language>')
@view('esl')
def esl_hit(language):
#when page is rendered, get assignmentID/hitID and attach it to displayed results
assignmentid=request.query.assignmentId
hitid=request.query.hitId
...
sql = " select es.* from esl_hits_data ehd, hits h, esl_sentences es where es.id=ehd.esl_sentence_id and h.id=ehd.hit_id and h.mturk_hit_id=hitid"
...

*** sentences = sentences returned from sql querey in array ***

params={

"hit_type":"vocabulary-ru",
"assignmentid":assignmentid,
"hitid":hitid,
"sentences":sentences,
}
return dict(params=params)





Then the sentences are loaded in esl.tpl from params (see line 274):




var sentences = [ 
 %for sentence in params['sentences']:
  "{{sentence["sentence"]}}",
 %end 
];

New HITs being added would need to have a similar .tpl file and @route() defined in main.py.

5 and 6. Retrieving and Storing the Data from MTurk (scripts: multi_test.py, buffer_update.py)

Pulling results from MTurk is fairly straightforward. The multi_test.py script uses the same boto library mentioned before to call MTurk and retrieve all the HIT results (including all the fields from your javascript forms) as a giant JSON object and stores them in a buffer table in the database. I left this script largely unchanged. My only alteration was that I decided to enter worker information into the esl_workers table at this point, largely because it simplified the processing I had to do in buffer_update if I could be guarenteed that all workers would have an existing entry in the table.

The buffer_update.py script reads the data out of the buffer and populates the relevant results tables. This script I edited to fit the specific way I wanted my data stored.

7 and 8. Grading Controls and Paying Workers (scripts: buffer_update.py, qc.py)

Once all the data has been entered into the DB and the results are parsed and stored, we are able to review the results, pay workers for good work, and reject work that is not up to par. This post, despite my best intentions, is becoming very long, so I will describe the details of my grading in a follow-up post. I will instead focus on the difference between the ESL QC methodology and the Spanish tweets QC.

I was able to include the ESL QC grading in buffer_update.py since our controls are automatically gradable. This is a notable simplification from the tweet translation HIT, which required a separate HIT for QC, in which new Turkers viewed and gave thumbs up/thumbs down on each translation. Right now, the ESL HIT results are the end result, which makes the WC processing easier. We have discussed piping the results of this ESL HIT into a new error annotation HIT, and potentially feeding results back to the translator, which would make more use of the loops seen in the tweet translation QC.

This post was much longer than I intended, I didn't even get to discuss the control generation and grading itself, which was my intention. I'll cover these in a few sister posts...try to handle the suspense...

*P.S. Every single time I type "retrieve", I spell it "retreive." I had to stop using it in method names. And I am a native English speaker. ::sigh::

Friday, August 10, 2012

Generating Errors

I am working on programatically inserting errors into our control sentences. Rather than using GenERRate, we decided to write our own script to do this. (I was going to fight to use GenERRate just because I think its a cute tool and it seems pretty robust, but then Chris made an indisputable argument: GenERRate is in Java. Our scripts could be in Python. Q.E.D.)

We restricted the errors we are inserting to four relatively easy-to-automate ones:

Spelling - randomly switching two adjacent letters
Prepositions - randomly change a preposition to another preposition
Determiners - randomly add, delete, or change a determiner (only adding before nouns that do not already have determiners)
Verbs - randomly switch a verb ending/form (e.g. add 'ing' or delete 's')

A few examples of sentences with generated errors:

After the Soviet invasion of Manchuria and the atomic bombings of Hiroshima and Nagasaki in 1945, Japan agreed to an unconditional surrender on 15 August.

After the Soviet <invasino> of Manchuria and <an> atomic bombings of Hiroshima and Nagasaki <by> 1945, Japan agreed to an unconditional surrender on 15 August.

Hideyoshi invaded Korea twice, but following defeats by Korean and Ming Chinese forces and Hideyoshi's death, Japanese troops were withdrawn in 1598.

Hideyoshi invaded Korea twice, but following defeats <from> Korean and <the> Ming

Chinese <forceed> and Hideyoshi's death, Japanese troops were withdrawn in 1598.

The village administration level is the most influential on a citizen's daily life, and handles matters of a village or neighborhood through an elected lurah or kepala desa (village chief).

The village administration level is the most influential on a citizen's daily life, and handles matters of <> <vlilage> or neighborhood <in> an elected lurah or kepala desa (village chief).

The Arabian Sea beach lines the southern coastline of Karachi.

The Arabian Sea <beahc> lines the southern coastline <from> <a> Karachi.

However, the Treaty of Peace with Japan was not signed until 1951, and that with Germany not until 1990.

However, <those> Treaty of Peace with Japan was not signed until 1951, and <of> with Germany not <untli> 1990.

ESL Controls and Matching Context

I response to Chris's comment, below are some examples of control sentences, and the context in which they'd appear in the HIT, pulled using tf*idf scores calculated using three different sets of sentences for reference:

only the 5 other sentences in the HIT
the 5 other sentences in the HIT, plus a 3 sentence window on each side
all other translations from that document

Intermediate Period: From reorganization till the proclamation of Republic era.

Era of Reformation: It is third era because it started after Sultanat-e-Usmanias end and that is why it is not our topic of discussion.

During the regime of Mohamed Fatheh lot of improvement had occurred in education and was himself a follower of learned people.

Mohammads follower spread education to the mass level and every Sultan used to build a mosque and with that it was mandatory to establish a school.

As a result the number of religious school were increased along with the mosques.

It ended when Mehmed I emerged as the sultan and restored Ottoman power, bringing an end to the Interregnum.

The millets were the major religious groups that were allowed to establish their own communities under Ottoman rule.

In the latter part of this period there were educational and technological reforms, including the establishment of higher education institutions such as the Istanbul Technical University.

Edo period lasted from the year 1603 to the year 1868.

Ayeasu is recognized as the most successful ruler in the history of Japan.

He won several wars through treason.

Although the Emperor always used to be the symbolic head of state in Japan the real power and jurisdiction remained at the disposal of Shogun or the head of military. But Ayeasu established a system of government based on the traditions of both the monarchy and feudalism.

Like Hideyoshi he also initially kept a soft spot for the Christians but the Portuguese and Spanish traders went only towards those places where the Catholic missionaries asked them to go.

Ieyasu was appointed shogun in 1603 and established the Tokugawa shogunate at Edo (modern Tokyo).

Japan has over 90,000 species of wildlife, including the brown bear, the Japanese macaque, the Japanese raccoon dog, and the Japanese giant salamander.

Later according to the Canada Act its name was kept as Canada and now this is the only name being used

A change that was reflected in the renaming of the national holiday from Dominion Day to Canada Day in 1982.

On 7th July 1969 according to the official language in the federal government french was given the status equal to English

From this Canadas journey of being a bilingual country started

English and French languages have equal importance in federal courts parliament and in all federal institutions.

English and French have equal status in federal courts, Parliament, and in all federal institutions.

Criminal law is solely a federal responsibility and is uniform throughout Canada.

The governmental occasion of PHP

The theme of first chapter is Jew and Christians criterion fulfilled and in place of them the foundation of Ismael (God Bless Him) as new people and their mentioning and their purification and filtration and the last pact with that God.

The second part talks about the Arab non believers and Allahs

The theme of third fourth fifth and sixth chapter is same which is the news of expression and purification and filtration.

The theme seventh and last chapter is to tell the rulers of Quraish about the day of Judgement and telling them the news of penalty and good news of Prophet Mohammed (PBUH) for the dominance of truth on the land of Arabs.

The number of verses differ from chapter to chapter.

As the Quran says, "With the truth we (God) have sent it down and with the truth it has come down.

Defiling or dismembering copies of the Quran is considered Quran desecration.

Much of the variation from using the whole document probably comes from the fact that some HITs contain sentences from two different documents, in which case the tf*idf is calculated using word frequencies over both documents.

Wednesday, August 8, 2012

August 8 - ESL HIT Quality Controls

Happily, the problems I mentioned at the end of my last post were quick fixes, and I have pulled together a list of controls in the way I'd described. I reworked the backend mturk scripts so that the HITs now include one control sentence mixed in with the normal ones. Unfortunately, I'm not too sure how well these controls will fool the Turkers. I am not sure what is off, but many of the controls cover obviously different topics than the translated sentences. The example below appropriately links the Urdu page on Japan to the English page, but the Urdu page contains much more limited content, and the tf*idf "best match" is not very convincing.

I optimistically flagged my control with the word "control" to make sure I could identify it among the other sentences. I feel like that was somewhat unnecessary...

I am going to comb over my tf*idf code to make sure there aren't any implementation bugs, and then fiddle with it a little more, and see if I can produce better results. Right now, it just gives points for rare words that positively overlap with the translation words; I might try punishing for rare words that do not appear anywhere in the translations...I really feel like "Pleistocene" is not a great word to use if you are trying to blend into a conversation about fighter jets...

Monday, August 6, 2012

ESL Quality Control

For starters: I'm officially a PhD student now!!!! I'm so excited! The status change didn't magically fix any of my bugs like I was hoping, but it still feels pretty good... :-)

It was exciting having data last week, but the quality of the edits we received was less than perfect. So this week, I am working on embedding some QC in our HIT, which will hopefully make round two's data much more interesting. We discussed a few quality control options:

Embedding sentences with known errors into the HIT and rejecting turkers' submissions which fail to recover these errors
Enforcing a native-speaker requirement (somewhat shaky, since we will need to trust Turkers' answers as to whether they are native speakers)
Enforcing location requirement (making sure Turkers are not located overseas)
Following the methodology of a paper Chris suggested, pipelining the correction into multiple HITs to try to make use of redundancy. The authors of the paper developed a wordprocessor with a crowdsourced essay editing service built in; they broke editing into three tasks, one to identify locations needing editing, one to edit those areas, and one to weed out bad edits. This could be an interesting approach, if our embedded control sentences do not get us the results we want. It is an interesting read, and very relevant since the concept behind their QC is very close to the pipeline we are using for translation quality.

I began working on the first step of the embedded controls. The first challenge with this form of control is picking sentences to use as controls which don't stand out from the other sentences in the HIT, which are all drawn from the same Wikipedia document. To do this, we discussed making use of the interlanguage links on Wikipedia, and pulling English sentences from a document that is as close to the original urdu document as possible. This should be fairly straightforward thanks to wikipydia, an awesome python library for querying Wikipedia. I am planning to generate controls for each HIT by:

Finding the document id (Wikipedia page id) from which the original Urdu sentences were taken for the translating task (Matt saved this as along with the translation data I used to generate the hits)
Query Wikipedia to pull the language links off of that page
Follow the English language link and pull the text of the English page
Calculate a tf*idf score for each sentence on the English page

This score will be done using the term and doc frequencies taken from the translations corresponding to the Urdu document of the same topic. This way, sentences on the English page that have a vocabulary most similar to the translated Urdu sentences we are trying to match will get the highest scores.

Choose the English sentence with the highest tf*idf calculated like above, normalized to control for length
Manually enter some ESL-esque errors into the sentence

I had written code to do much of this process- pulling from Wikipedia and calculating term and doc frequencies. Now I have run into two hurdles:

1) My code that was pulling the English sentences flawlessly yesterday using wikipydia is now crashing. I am hoping it is an internet connection error, since it seems to have come out of no where, but am trying to recover it so that I can continue.

2) In order to pick the English sentences that correspond to the correct HIT, I need to pull the document IDs for each HIT off of our DB. Unfortunately, the database seems to be down and Dmitry is in Siberia (his being in Siberia is not really relevant, but I like to mention it because I feel like it just adds drama to the situation). So I will need to ask around and see if we can get our DB up and running, which will hopefully be easily done.