Understanding lm-files and discounting

Mon Dec 3 02:38:29 PST 2007

I spent last weekend trying to figure out the discrepancies between the
SRILM kn-discounting implementations and my earlier implementations.
Basically I am trying to go from the text file to the count file to
the model file
to the probabilities assigned to the words in the test file.  This took me on a
journey from man pages to debug outputs to the source code.  I figured
a lot of it out but it turned out to be nontrivial to go from paper
descriptions to the numbers in the ARPA ngram format to the final
probability calculations.  If you help me with a couple of things I
promise I'll write a man page detailing all discounting calculations
in SRILM.

1. Sometimes the model seems to use smaller ngrams even when longer
ones are in the training file.  An example from a letter model:

E i s e n h o w e r
       p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
       p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
       p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
       p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
       p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
       p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
       p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
       p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
       p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
       p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
       p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
1 sentences, 10 words, 0 OOVs
0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213

This is an -order 7 model and the training file does have the word
Eisenhower.  So I don't understand why it goes back to using lower
order ngrams after the letter 'h'.

2. Not all (n-1)-grams have backoff weights in the model file, why?

3. What exactly does srilm do with google ngrams?  Can you give an
example usage?  Does it do things like extract a small subset useful
for evaluating a test file?

4. Since google-ngrams have all ngrams below count=40 missing, the kn
discount constants that rely on the number of ngrams with low counts
will fail.  Also I found that empirically the best highest order
discount constant is close to 40, not in the [0,1] range.  How does
srilm handle this?

5. Do I need to understand what the following messages mean to
understand the calculations:
warning: 7.65818e-10 backoff probability mass left for "" --
incrementing denominator
warning: distributing 0.000254455 left-over probability mass over all 124 words
discarded 254764 7-gram probs discounted to zero
inserted 2766 redundant 3-gram probs

best,
deniz