Understanding lm-files and discounting

Mon Dec 10 07:41:16 PST 2007

Working on that documentation as promised.  Small question about the
mincounts: I was able to verify what you said with the default (gt)
discount, but with kn or ukndiscount some long ngrams with cnt=1 are
included in the model.  Since the counts are modified I thought maybe
it is looking at unmodified counts, but then there are some ngrams
excluded with regular count > 1 and kncount = 1.  So I couldn't quite
figure out what subset is included in the model with kndiscounting.

deniz

On Dec 4, 2007 8:07 AM, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>
> In message <cea871f80712030238r148279bdsf4664161e710a2a2 at mail.gmail.com>you wro
> te:
> > I spent last weekend trying to figure out the discrepancies between the
> > SRILM kn-discounting implementations and my earlier implementations.
> > Basically I am trying to go from the text file to the count file to
> > the model file
> > to the probabilities assigned to the words in the test file.  This took me on
> >  a
> > journey from man pages to debug outputs to the source code.  I figured
> > a lot of it out but it turned out to be nontrivial to go from paper
> > descriptions to the numbers in the ARPA ngram format to the final
> > probability calculations.  If you help me with a couple of things I
> > promise I'll write a man page detailing all discounting calculations
> > in SRILM.
>
> A tutorial or FAQ including the information below would be most useful!
>
> >
> > 1. Sometimes the model seems to use smaller ngrams even when longer
> > ones are in the training file.  An example from a letter model:
> >
> > E i s e n h o w e r
> >        p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
> >        p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
> >        p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
> >        p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
> >        p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
> >        p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
> >        p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
> >        p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
> >        p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
> >        p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
> >        p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
> > 1 sentences, 10 words, 0 OOVs
> > 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213
> >
> > This is an -order 7 model and the training file does have the word
> > Eisenhower.  So I don't understand why it goes back to using lower
> > order ngrams after the letter 'h'.
>
> This is because the default "mincount" for N-grams longer than 2 words is 2,
> Meaning that a trigram, 4gram, etc. has to occur at least twice to be included
> in the LM.
> You can change this with the options
>
>         -gt3min 1
>         -gt4min 1
>         etc.
>
>
> >
> > 2. Not all (n-1)-grams have backoff weights in the model file, why?
>
> Backoff weights are only recorded for N-grams that appear as the prefix
> of a longer N-gram.  For all others the backoff weight is implicitly 1
> (or 0, in log representation).  This convention saves a lot of space.
>
> >
> > 3. What exactly does srilm do with google ngrams?  Can you give an
> > example usage?  Does it do things like extract a small subset useful
> > for evaluating a test file?
>
> Google n-grams are not an LM format, they are way to store N-gram counts
> on disk, and the classes that implement N-gram counts know how to read them.
> This is exercized by the ngram-count -read-google option.
> However, due to their typical size it is not advisable to try to build
> backoff LMs of the standard sort, which would require reading all N-grams
> into memory (someone working at Google might actually be able to do this
> if their hardware budget is as phenomenal as it must be).
>
> Instead, I recommend estimating a deleted-interpolation-smoothed
> "count LM", i.e, an LM that consists of only a small number of
> interpolation weights (for smoothing) as well as the raw N-gram counts
> themselves.  This way we can in fact load only the portion of the counts
> into memory that impinge on a given test set (triggered by the
> ngram -limit-vocab option).
>
> There is no full example of this, but it is basically what you see in
> $SRILM/test/tests/ngram-count-lm-limit-vocab .  The only change would be
> that instead of a countlm file with the keyword "counts" you would
> use the keyword "google-counts" followed by the path to the google count
> directory root.  Read the man page sections for ngram-count -count-lm and
> ngram -count-lm  for more information, and follow the example under the test
> directory.
>
> >
> > 4. Since google-ngrams have all ngrams below count=40 missing, the kn
> > discount constants that rely on the number of ngrams with low counts
> > will fail.  Also I found that empirically the best highest order
> > discount constant is close to 40, not in the [0,1] range.  How does
> > srilm handle this?
>
> The deleted interpolation method of smoothing I am recommending above does
> not have a problem with the missing ngrams.
>
> There is also a way to extrapolate from the available counts-of-counts above
> some threshold to those below the threshold, due to an empirical law that
> we found to hold for a range of corpora.  For details see the paper
>
> W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto.
> http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
>
> The extrapolation method is implemented in the script
> $SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use
> make-big-lm to build your LM.   Again, it is not feasible to do this on
> the ngrams distributed by Google.
>
> > 5. Do I need to understand what the following messages mean to
> > understand the calculations:
>
> Not really, they are for information only.
>
> > warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator
>
> This means your unigram probabilities even after discounting sum to (almost) 1.
> As a crude fallback, the denominator in the estimator is incremented to yield
> usable backoff probability mass.
>
> > warning: distributing 0.000254455 left-over probability mass over all 124 wor
> > ds
>
> Here the backof mass is 0.000254455 and is spread out over the 124 words that
> don't have any observed occurrences.
>
> > discarded 254764 7-gram probs discounted to zero
>
> Due to discounting cutoff (mincounts, see above) some 7-grams were not
> included in the model.
>
> > inserted 2766 redundant 3-gram probs
>
> The ARPA format requires all prefixes of ngrams with probabilities to
> also have probabilities.  E.g., if "a b c" is in the model, so must "a b",
> even if "a b" was not in the input ngram counts.  In such cases SRILM will
> insert the "a b" probability but make it equal to what the backoff computation
> would yield.
>
> Andreas
>
>