Using and understanding LM file (with modified Kneser-Ney smoothing)

Fri Nov 30 19:48:59 PST 2007

Lane,

there is a key misunderstanding here.  The interpolation of higher- and 
lower-order probability estimates (triggered by ngram-count 
-interpolate) happens at training time, and the final probability 
estimates are then stored in the LM file.  Hence, no interpolation is 
required at test time.
In fact, all LMs in ARPA backoff format are handled exactly the same in 
testing.  The different smoothing methods only come in during training.

I hope this answers your question.

Andreas

Lane Schwartz wrote:
> Hi,
>
> I'm working on some machine translation code which in which I'd like 
> incorporate a language model. I'm trying to replicate the system 
> described in David Chiang's 2005 ACL paper; in that paper, his 
> language model is a trigram model which uses modified Kneser-Ney 
> smoothing.
>
> My goal is to train the LM using the SRILM toolkit, then use the 
> generated LM file in my own code.
>
> I've looked over Chen & Goodman (1998), and I think I understand the 
> ideas, but I'm having some trouble understanding how to make sense of 
> the numbers in the LM file (produced by ngram-count).
>
> Any help would be greatly appreciated.
>
> My training corpus is the first 10000 words of the English side of the 
> de-en Europarl training corpus 
> (http://www.cs.umn.edu/research/nlp/mt/wmt06/europarl.de-en.en.gz), 
> which I have lowercased and converted to UTF-8. Again, my goal is a 
> trigram language model which uses modified Kneser-Ney smoothing, and I 
> want to use interpolation - here's what I did to get the LM file:
>
> $ zcat europarl.de-en.en.gz | head -n 10000 | ngram-count -text - 
> -order 3 -kndiscount -interpolate -lm sample.srilm
>
> Since I'm trying to understand how to apply the ngram probabilities 
> and backoff-weights, I'm testing using a very simple test phrase:
>
> echo "the man in" > sample.txt
>
> Here are the (I think) relevant lines from the LM file:
>
> unigrams:
> -2.987062    </s>
> -99    <s>    -1.142606
> -1.73375    in    -0.660575
> -3.960678    man    -0.1932579
> -1.781734    the    -0.5241315
>
> bigrams:
> -0.8540089    <s> the    -0.3293318
> -1.516293    man in
> -3.496579    the man    -0.09554159
>
> trigrams:
> -0.6538057    the man in
>
>
>
> I then ran the ngram tool to see what it does with this phrase:
>
> $ ngram -lm sample.srilm -ppl sample.txt -debug 3
> reading 10209 1-grams
> reading 78195 2-grams
> reading 20317 3-grams
> the man in
>         p( the | <s> )  = [2gram] 0.139956 [ -0.854009 ] / 1
>         p( man | the ...)       = [2gram] 0.00014931 [ -3.82591 ] / 1
>         p( in | man ...)        = [3gram] 0.221919 [ -0.653806 ] / 1
>         p( </s> | in ...)       = [1gram] 0.000225094 [ -3.64764 ] / 1
> 1 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797
>
> file sample.txt: 1 sentences, 3 words, 0 OOVs
> 0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797
>
>
>
> I'd like to make sense of the above numbers.
>
> The first line, p( the | <s> ), makes sense, since the bigram log prob 
> for "<s> the" in lm.srilm is -0.8540089.
>
> I'm getting stuck figuring out where -3.82591 comes from in p( man | 
> the ...). It seems that the formula should be:
> interpolated P( man | the ) = lamda_man*P(man) + (1 - 
> lamda_man)*(lamda_man|the * p(man|the))
>
> If the weights listed above are the lamdas in the above equation, that 
> gives us the following (converting from log domain to regular domain 
> as we go):
>
> lamda_man = 10**(-0.1932579
> P(man) = 10**(-3.960678)
> lamda_man|the = 10**-0.09554159
> P(man|the) = 10**-3.496579
>
> So my interpolated P( man | the ) calculation gives 0.000162027. The 
> ngram util gave 0.00014931.
>
>
> If anyone could help point out where I'm screwing up, it would be very 
> much appreciated. Am I running with the appropriate parameters to 
> ngram-count and ngram, given that I want an interpolated LM with 
> modified Kneser-Ney smoothing (as used by Chiang(2005))? Does my 
> equation above look right? I know this is a long email - thanks for 
> your time and thoughts.
>
> Thanks,
> Lane Schwartz
>
> University of Minnesota
>