Understanding lm-files and discounting

Mon Dec 3 22:07:11 PST 2007

In message <cea871f80712030238r148279bdsf4664161e710a2a2 at mail.gmail.com>you wro
te:
> I spent last weekend trying to figure out the discrepancies between the
> SRILM kn-discounting implementations and my earlier implementations.
> Basically I am trying to go from the text file to the count file to
> the model file
> to the probabilities assigned to the words in the test file.  This took me on
>  a
> journey from man pages to debug outputs to the source code.  I figured
> a lot of it out but it turned out to be nontrivial to go from paper
> descriptions to the numbers in the ARPA ngram format to the final
> probability calculations.  If you help me with a couple of things I
> promise I'll write a man page detailing all discounting calculations
> in SRILM.

A tutorial or FAQ including the information below would be most useful!

> 
> 1. Sometimes the model seems to use smaller ngrams even when longer
> ones are in the training file.  An example from a letter model:
> 
> E i s e n h o w e r
>        p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
>        p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
>        p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
>        p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
>        p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
>        p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
>        p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
>        p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
>        p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
>        p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
>        p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
> 1 sentences, 10 words, 0 OOVs
> 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213
> 
> This is an -order 7 model and the training file does have the word
> Eisenhower.  So I don't understand why it goes back to using lower
> order ngrams after the letter 'h'.

This is because the default "mincount" for N-grams longer than 2 words is 2,
Meaning that a trigram, 4gram, etc. has to occur at least twice to be included
in the LM.
You can change this with the options

	-gt3min 1
	-gt4min 1
	etc.

> 
> 2. Not all (n-1)-grams have backoff weights in the model file, why?

Backoff weights are only recorded for N-grams that appear as the prefix 
of a longer N-gram.  For all others the backoff weight is implicitly 1
(or 0, in log representation).  This convention saves a lot of space.

> 
> 3. What exactly does srilm do with google ngrams?  Can you give an
> example usage?  Does it do things like extract a small subset useful
> for evaluating a test file?

Google n-grams are not an LM format, they are way to store N-gram counts
on disk, and the classes that implement N-gram counts know how to read them.
This is exercized by the ngram-count -read-google option.
However, due to their typical size it is not advisable to try to build 
backoff LMs of the standard sort, which would require reading all N-grams 
into memory (someone working at Google might actually be able to do this
if their hardware budget is as phenomenal as it must be).

Instead, I recommend estimating a deleted-interpolation-smoothed
"count LM", i.e, an LM that consists of only a small number of 
interpolation weights (for smoothing) as well as the raw N-gram counts
themselves.  This way we can in fact load only the portion of the counts
into memory that impinge on a given test set (triggered by the 
ngram -limit-vocab option).

There is no full example of this, but it is basically what you see in 
$SRILM/test/tests/ngram-count-lm-limit-vocab .  The only change would be 
that instead of a countlm file with the keyword "counts" you would
use the keyword "google-counts" followed by the path to the google count
directory root.  Read the man page sections for ngram-count -count-lm and 
ngram -count-lm  for more information, and follow the example under the test
directory.

> 
> 4. Since google-ngrams have all ngrams below count=40 missing, the kn
> discount constants that rely on the number of ngrams with low counts
> will fail.  Also I found that empirically the best highest order
> discount constant is close to 40, not in the [0,1] range.  How does
> srilm handle this?

The deleted interpolation method of smoothing I am recommending above does
not have a problem with the missing ngrams.

There is also a way to extrapolate from the available counts-of-counts above
some threshold to those below the threshold, due to an empirical law that
we found to hold for a range of corpora.  For details see the paper

W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto. 
http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz

The extrapolation method is implemented in the script
$SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use 
make-big-lm to build your LM.   Again, it is not feasible to do this on
the ngrams distributed by Google.

> 5. Do I need to understand what the following messages mean to
> understand the calculations:

Not really, they are for information only.

> warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator

This means your unigram probabilities even after discounting sum to (almost) 1.
As a crude fallback, the denominator in the estimator is incremented to yield
usable backoff probability mass.

> warning: distributing 0.000254455 left-over probability mass over all 124 wor
> ds

Here the backof mass is 0.000254455 and is spread out over the 124 words that 
don't have any observed occurrences.

> discarded 254764 7-gram probs discounted to zero

Due to discounting cutoff (mincounts, see above) some 7-grams were not
included in the model.

> inserted 2766 redundant 3-gram probs

The ARPA format requires all prefixes of ngrams with probabilities to 
also have probabilities.  E.g., if "a b c" is in the model, so must "a b",
even if "a b" was not in the input ngram counts.  In such cases SRILM will
insert the "a b" probability but make it equal to what the backoff computation
would yield.

Andreas