Tagging with disambig

Thu May 13 16:59:48 PDT 2004

In message <005a01c438ff$3efbc5c0$34284484 at cs.technion.ac.il>you wrote:
> Hi,
> 
> I use disambig for POS tagging.
> 
> I have two questions:
> 1.Is there a utility that automatically generates the map file required
> for disambig from a tagged corpus?

It's very corpus dependent, just like text conditioning for LM training,
so there are no "standard" tools.  It should require only a moderate
amount of perl or gawk hacking.

> 2.Suppose I want to assume (for a 'didactic' purpose) that Ti (the i'th
> tag) depends not ony on Ti-1 but also on Wi-1. Is there an easy way to
> encode this assumption into the lm file?

Depends on what you consider "easy" ;-).

You can do it by including the words in the states of the HMM.
So the "hidden" vocabulary would consist of pairs (Wi,Ti), and 
the observed vocabulary is still the words Wi.  The map file
would enforce consistency between the two.  In other words the
map file just lists the possible correspondences

W	w,t1 w,t2 w,t3 ...

(the probabilities can be omitted and default to 1).

If you do this and nothing else you would need an N-gram LM over the 
combined (Wi,Ti) sequence.  But you say you want a more specific model
of the form

	P(Ti | Wi-1, Ti-1)

This, too, can be done but requires some work.
You construct a trigram count file of 3-grams (Wi-1, Ti-1, Ti)
from your training data, and estimate an LM for it (be sure to specify all the
words as non-events so they don't receive any probability).

Then you construct a bigram LM in terms of the (W,T) tokens, such that it
gives exactly the same probabilities as the more constrained model
you just estimated.  So you have to construct a bigram LM file 
and make sure that the bigram

	Wi-1,Ti-1   Wi,Ti

gets the probility  P(Ti | Wi-1, Ti-1) * P(Wi|Ti),
for all Wi-1,Ti-1,Wi,Ti .
You have to write your own program to construct this file 
in ARPA LM format, but it's not rocket science once you understand
the format.

Then you decode using this LM and disambig.

--Andreas