Using and understanding LM file (with modified Kneser-Ney smoothing)

Wed Nov 28 15:15:00 PST 2007

Hi,

I'm working on some machine translation code which in which I'd like  
incorporate a language model. I'm trying to replicate the system  
described in David Chiang's 2005 ACL paper; in that paper, his  
language model is a trigram model which uses modified Kneser-Ney  
smoothing.

My goal is to train the LM using the SRILM toolkit, then use the  
generated LM file in my own code.

I've looked over Chen & Goodman (1998), and I think I understand the  
ideas, but I'm having some trouble understanding how to make sense of  
the numbers in the LM file (produced by ngram-count).

Any help would be greatly appreciated.

My training corpus is the first 10000 words of the English side of the  
de-en Europarl training corpus (http://www.cs.umn.edu/research/nlp/mt/wmt06/europarl.de-en.en.gz 
), which I have lowercased and converted to UTF-8. Again, my goal is a  
trigram language model which uses modified Kneser-Ney smoothing, and I  
want to use interpolation - here's what I did to get the LM file:

$ zcat europarl.de-en.en.gz | head -n 10000 | ngram-count -text - - 
order 3 -kndiscount -interpolate -lm sample.srilm

Since I'm trying to understand how to apply the ngram probabilities  
and backoff-weights, I'm testing using a very simple test phrase:

echo "the man in" > sample.txt

Here are the (I think) relevant lines from the LM file:

unigrams:
-2.987062	</s>
-99	<s>	-1.142606
-1.73375	in	-0.660575
-3.960678	man	-0.1932579
-1.781734	the	-0.5241315

bigrams:
-0.8540089	<s> the	-0.3293318
-1.516293	man in
-3.496579	the man	-0.09554159

trigrams:
-0.6538057	the man in

I then ran the ngram tool to see what it does with this phrase:

$ ngram -lm sample.srilm -ppl sample.txt -debug 3
reading 10209 1-grams
reading 78195 2-grams
reading 20317 3-grams
the man in
         p( the | <s> )  = [2gram] 0.139956 [ -0.854009 ] / 1
         p( man | the ...)       = [2gram] 0.00014931 [ -3.82591 ] / 1
         p( in | man ...)        = [3gram] 0.221919 [ -0.653806 ] / 1
         p( </s> | in ...)       = [1gram] 0.000225094 [ -3.64764 ] / 1
1 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797

file sample.txt: 1 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797

I'd like to make sense of the above numbers.

The first line, p( the | <s> ), makes sense, since the bigram log prob  
for "<s> the" in lm.srilm is -0.8540089.

I'm getting stuck figuring out where -3.82591 comes from in p( man |  
the ...). It seems that the formula should be:
interpolated P( man | the ) = lamda_man*P(man) + (1 -  
lamda_man)*(lamda_man|the * p(man|the))

If the weights listed above are the lamdas in the above equation, that  
gives us the following (converting from log domain to regular domain  
as we go):

lamda_man = 10**(-0.1932579
P(man) = 10**(-3.960678)
lamda_man|the = 10**-0.09554159
P(man|the) = 10**-3.496579

So my interpolated P( man | the ) calculation gives 0.000162027. The  
ngram util gave 0.00014931.

If anyone could help point out where I'm screwing up, it would be very  
much appreciated. Am I running with the appropriate parameters to  
ngram-count and ngram, given that I want an interpolated LM with  
modified Kneser-Ney smoothing (as used by Chiang(2005))? Does my  
equation above look right? I know this is a long email - thanks for  
your time and thoughts.

Thanks,
Lane Schwartz

University of Minnesota