Understanding lm-files and discounting

Mon Dec 10 13:51:51 PST 2007

Deniz Yuret wrote:
> Working on that documentation as promised.  Small question about the
> mincounts: I was able to verify what you said with the default (gt)
> discount, but with kn or ukndiscount some long ngrams with cnt=1 are
> included in the model.  Since the counts are modified I thought maybe
> it is looking at unmodified counts, but then there are some ngrams
> excluded with regular count > 1 and kncount = 1.  So I couldn't quite
> figure out what subset is included in the model with kndiscounting.
>   
I think what you're seeing can be explained by the following two facts:

1 - with KN discounting the mincounts are indeed applied to the modified 
lower-order counts.
2 - However (and this is true for all smoothing methods), if an ngram "a 
b c d" is included in the model based on its counts, then all prefixes 
of that ngram also need to be included (otherwise you'd have an empty 
first column in the lm file at those prefix ngrams, which is illegal).
So, if mincount is 2 for 4grams and 3grams, and a b c d occurs twice, 
but (after count modification) a b c occurs only once, then a b c would 
still be included in the LM.

See if the above is in agreement with your observations.

Andreas

> deniz
>
>
>
> On Dec 4, 2007 8:07 AM, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>   
>> In message <cea871f80712030238r148279bdsf4664161e710a2a2 at mail.gmail.com>you wro
>> te:
>>     
>>> I spent last weekend trying to figure out the discrepancies between the
>>> SRILM kn-discounting implementations and my earlier implementations.
>>> Basically I am trying to go from the text file to the count file to
>>> the model file
>>> to the probabilities assigned to the words in the test file.  This took me on
>>>  a
>>> journey from man pages to debug outputs to the source code.  I figured
>>> a lot of it out but it turned out to be nontrivial to go from paper
>>> descriptions to the numbers in the ARPA ngram format to the final
>>> probability calculations.  If you help me with a couple of things I
>>> promise I'll write a man page detailing all discounting calculations
>>> in SRILM.
>>>       
>> A tutorial or FAQ including the information below would be most useful!
>>
>>     
>>> 1. Sometimes the model seems to use smaller ngrams even when longer
>>> ones are in the training file.  An example from a letter model:
>>>
>>> E i s e n h o w e r
>>>        p( E | <s> )    = [2gram] 0.0122983 [ -1.91016 ] / 1
>>>        p( i | E ...)   = [3gram] 0.0143471 [ -1.84324 ] / 1
>>>        p( s | i ...)   = [4gram] 0.308413 [ -0.510867 ] / 1
>>>        p( e | s ...)   = [5gram] 0.412852 [ -0.384206 ] / 1
>>>        p( n | e ...)   = [6gram] 0.759049 [ -0.11973 ] / 1
>>>        p( h | n ...)   = [7gram] 0.397406 [ -0.400766 ] / 1
>>>        p( o | h ...)   = [4gram] 0.212227 [ -0.6732 ] / 1
>>>        p( w | o ...)   = [3gram] 0.0199764 [ -1.69948 ] / 1
>>>        p( e | w ...)   = [4gram] 0.165049 [ -0.782387 ] / 1
>>>        p( r | e ...)   = [4gram] 0.222122 [ -0.653408 ] / 1
>>>        p( </s> | r ...)        = [5gram] 0.492478 [ -0.307613 ] / 1
>>> 1 sentences, 10 words, 0 OOVs
>>> 0 zeroprobs, logprob= -9.28505 ppl= 6.98386 ppl1= 8.48213
>>>
>>> This is an -order 7 model and the training file does have the word
>>> Eisenhower.  So I don't understand why it goes back to using lower
>>> order ngrams after the letter 'h'.
>>>       
>> This is because the default "mincount" for N-grams longer than 2 words is 2,
>> Meaning that a trigram, 4gram, etc. has to occur at least twice to be included
>> in the LM.
>> You can change this with the options
>>
>>         -gt3min 1
>>         -gt4min 1
>>         etc.
>>
>>
>>     
>>> 2. Not all (n-1)-grams have backoff weights in the model file, why?
>>>       
>> Backoff weights are only recorded for N-grams that appear as the prefix
>> of a longer N-gram.  For all others the backoff weight is implicitly 1
>> (or 0, in log representation).  This convention saves a lot of space.
>>
>>     
>>> 3. What exactly does srilm do with google ngrams?  Can you give an
>>> example usage?  Does it do things like extract a small subset useful
>>> for evaluating a test file?
>>>       
>> Google n-grams are not an LM format, they are way to store N-gram counts
>> on disk, and the classes that implement N-gram counts know how to read them.
>> This is exercized by the ngram-count -read-google option.
>> However, due to their typical size it is not advisable to try to build
>> backoff LMs of the standard sort, which would require reading all N-grams
>> into memory (someone working at Google might actually be able to do this
>> if their hardware budget is as phenomenal as it must be).
>>
>> Instead, I recommend estimating a deleted-interpolation-smoothed
>> "count LM", i.e, an LM that consists of only a small number of
>> interpolation weights (for smoothing) as well as the raw N-gram counts
>> themselves.  This way we can in fact load only the portion of the counts
>> into memory that impinge on a given test set (triggered by the
>> ngram -limit-vocab option).
>>
>> There is no full example of this, but it is basically what you see in
>> $SRILM/test/tests/ngram-count-lm-limit-vocab .  The only change would be
>> that instead of a countlm file with the keyword "counts" you would
>> use the keyword "google-counts" followed by the path to the google count
>> directory root.  Read the man page sections for ngram-count -count-lm and
>> ngram -count-lm  for more information, and follow the example under the test
>> directory.
>>
>>     
>>> 4. Since google-ngrams have all ngrams below count=40 missing, the kn
>>> discount constants that rely on the number of ngrams with low counts
>>> will fail.  Also I found that empirically the best highest order
>>> discount constant is close to 40, not in the [0,1] range.  How does
>>> srilm handle this?
>>>       
>> The deleted interpolation method of smoothing I am recommending above does
>> not have a problem with the missing ngrams.
>>
>> There is also a way to extrapolate from the available counts-of-counts above
>> some threshold to those below the threshold, due to an empirical law that
>> we found to hold for a range of corpora.  For details see the paper
>>
>> W. Wang, A. Stolcke, & J. Zheng (2007), Reranking Machine Translation Hypotheses With Structured and Web-based Language Models. To appear in Proc. IEEE Automatic Speech Recognition and Understanding Workshop, Kyoto.
>> http://www.speech.sri.com/cgi-bin/run-distill?papers/asru2007-mt-lm.ps.gz
>>
>> The extrapolation method is implemented in the script
>> $SRILM/utils/src/make-kn-discounts.gawk and is automatically invoked if you use
>> make-big-lm to build your LM.   Again, it is not feasible to do this on
>> the ngrams distributed by Google.
>>
>>     
>>> 5. Do I need to understand what the following messages mean to
>>> understand the calculations:
>>>       
>> Not really, they are for information only.
>>
>>     
>>> warning: 7.65818e-10 backoff probability mass left for "" -- incrementing denominator
>>>       
>> This means your unigram probabilities even after discounting sum to (almost) 1.
>> As a crude fallback, the denominator in the estimator is incremented to yield
>> usable backoff probability mass.
>>
>>     
>>> warning: distributing 0.000254455 left-over probability mass over all 124 wor
>>> ds
>>>       
>> Here the backof mass is 0.000254455 and is spread out over the 124 words that
>> don't have any observed occurrences.
>>
>>     
>>> discarded 254764 7-gram probs discounted to zero
>>>       
>> Due to discounting cutoff (mincounts, see above) some 7-grams were not
>> included in the model.
>>
>>     
>>> inserted 2766 redundant 3-gram probs
>>>       
>> The ARPA format requires all prefixes of ngrams with probabilities to
>> also have probabilities.  E.g., if "a b c" is in the model, so must "a b",
>> even if "a b" was not in the input ngram counts.  In such cases SRILM will
>> insert the "a b" probability but make it equal to what the backoff computation
>> would yield.
>>
>> Andreas
>>
>>
>>