a naive question need your help

Andreas Stolcke stolcke at speech.sri.com
Thu Aug 14 13:16:18 PDT 2008


jian zhu wrote:
> Hi professor stolcke:
>     I am a computer programmer from China. Thanks a lot for your great
> work on language model, and unselfishly sharing the perfect slm
> tookit!
>
>     I have a naive question need your help.
>     I want to use "disambig" tool for part-of-speech tagging, but I
> have some trouble
> with it.
>     I use the tool as following:
>     disambig -text file -map wtfile -lm ttfile
>
>     file      ---   word text
>     wtfile   ---   P(word|tag2) emit file
>     ttfile    ---    P(tag2|tag1) transit file
>
>     ttfile can be trained using "ngram-count" tool, but i don't know
> how i can get
>     wtfile, i don't know how i can get this file by using srilm.
>
>     it's format is as following:
>     -map file
>    Specifies the file containing the V1-to-V2 mapping information.
> Each line of file contains the mapping for a single word in V1:
> 	w1	w21 [p21] w22 [p22] ...
>
>      where w1 is a word from V1, which has possible mappings w21, w22,
> ... from V2. Optionally, each of these can be followed by a numeric
> string for the probability p21, which defaults to 1. The number is
> used as the conditional probability P(w1|w21), but the program does
> not depend on these numbers being properly normalized.
>
>     Thank you very much!
>      Looking forward for your help.
>   
There is no ready-made tool for estimating and formatting the map 
probabilities.  It is such a simple format that you should be able to 
write a perl script or similar to estimate these probabilities from 
data.  Note that for taggers it is usually more convenient to construct 
the map file with probabilities p(w21 | w1) and use the -scale option.
To estimate p(POS | word) you can count occurrences in a tagged training 
corpus (possibly with some smoothing to allow for unseen combinations 
(for unseen words and open-class POS classes).   In the absence of 
training data you can try a uniform POS distribution.

I know that people have built POS taggers with SRILM.  I suggest that 
you direct further questions to the srilm-user mailing list.

Andreas

> Best Regards
> jianzhu
> 2008-08-14
>   





More information about the SRILM-User mailing list