Using and understanding LM file (with modified Kneser-Ney smoothing)
Lane Schwartz
schwa717 at umn.edu
Wed Nov 28 15:15:00 PST 2007
Hi,
I'm working on some machine translation code which in which I'd like
incorporate a language model. I'm trying to replicate the system
described in David Chiang's 2005 ACL paper; in that paper, his
language model is a trigram model which uses modified Kneser-Ney
smoothing.
My goal is to train the LM using the SRILM toolkit, then use the
generated LM file in my own code.
I've looked over Chen & Goodman (1998), and I think I understand the
ideas, but I'm having some trouble understanding how to make sense of
the numbers in the LM file (produced by ngram-count).
Any help would be greatly appreciated.
My training corpus is the first 10000 words of the English side of the
de-en Europarl training corpus (http://www.cs.umn.edu/research/nlp/mt/wmt06/europarl.de-en.en.gz
), which I have lowercased and converted to UTF-8. Again, my goal is a
trigram language model which uses modified Kneser-Ney smoothing, and I
want to use interpolation - here's what I did to get the LM file:
$ zcat europarl.de-en.en.gz | head -n 10000 | ngram-count -text - -
order 3 -kndiscount -interpolate -lm sample.srilm
Since I'm trying to understand how to apply the ngram probabilities
and backoff-weights, I'm testing using a very simple test phrase:
echo "the man in" > sample.txt
Here are the (I think) relevant lines from the LM file:
unigrams:
-2.987062 </s>
-99 <s> -1.142606
-1.73375 in -0.660575
-3.960678 man -0.1932579
-1.781734 the -0.5241315
bigrams:
-0.8540089 <s> the -0.3293318
-1.516293 man in
-3.496579 the man -0.09554159
trigrams:
-0.6538057 the man in
I then ran the ngram tool to see what it does with this phrase:
$ ngram -lm sample.srilm -ppl sample.txt -debug 3
reading 10209 1-grams
reading 78195 2-grams
reading 20317 3-grams
the man in
p( the | <s> ) = [2gram] 0.139956 [ -0.854009 ] / 1
p( man | the ...) = [2gram] 0.00014931 [ -3.82591 ] / 1
p( in | man ...) = [3gram] 0.221919 [ -0.653806 ] / 1
p( </s> | in ...) = [1gram] 0.000225094 [ -3.64764 ] / 1
1 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797
file sample.txt: 1 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -8.98136 ppl= 175.93 ppl1= 985.797
I'd like to make sense of the above numbers.
The first line, p( the | <s> ), makes sense, since the bigram log prob
for "<s> the" in lm.srilm is -0.8540089.
I'm getting stuck figuring out where -3.82591 comes from in p( man |
the ...). It seems that the formula should be:
interpolated P( man | the ) = lamda_man*P(man) + (1 -
lamda_man)*(lamda_man|the * p(man|the))
If the weights listed above are the lamdas in the above equation, that
gives us the following (converting from log domain to regular domain
as we go):
lamda_man = 10**(-0.1932579
P(man) = 10**(-3.960678)
lamda_man|the = 10**-0.09554159
P(man|the) = 10**-3.496579
So my interpolated P( man | the ) calculation gives 0.000162027. The
ngram util gave 0.00014931.
If anyone could help point out where I'm screwing up, it would be very
much appreciated. Am I running with the appropriate parameters to
ngram-count and ngram, given that I want an interpolated LM with
modified Kneser-Ney smoothing (as used by Chiang(2005))? Does my
equation above look right? I know this is a long email - thanks for
your time and thoughts.
Thanks,
Lane Schwartz
University of Minnesota
More information about the SRILM-User
mailing list