GT discounting and backoff

Sun Dec 28 15:06:51 PST 2003

Roy,

In message <000901c3c7c9$4dedf460$34284484 at cs.technion.ac.il>you wrote:
> Hi,
> 
> I have a few questions about the implementation of GT-discounting and
> Katz backoff in ngram-count.
> 
> 1. What is the default value of gtNmin and gtNmax in ngram-count?

It differs for different N.  Run ngram-count -help to see all the 
default parameters.

> 
> 2. Is backing off done only for ngrams that don't appear in the language
> model at all, or for ngrams that appear less than k>0 times (and what is
> this k). If I want backing off to be done only for counts below some k,
> should I set gtNmin to that value?

Exactly.  However, for all N-grams in the *language model* the corresponding
conditional N-gram probability is used, always.  So the cutoffs refer not to
the LM itself, but to the counts in the *training data*.
> 
> 3. What does the following warning mean:
> 
> warning: discount coeff 4 is out of range
> 
> Does it mean that the discount for ngrams that appears only 4 times is
> very small? Why is it a warning?

The warning indicates that the GT discount formula yields a value outside 
the range 0...1, and therefore cannot be used.  This happens when your
counts-of-counts (how many singleton, 2-counts, 3-counts, etc.) are not
smoothly distributed, usually as the result of insufficient data, or 
some artificial manipulation of the data (e.g. duplicating some portion
of it).   ngram-count simply disables discounting for those ngrams.
If you get this a lot you can try some of the other smoothing methods.
Witten-Bell, for example, is very robust to the kinds of problems that 
cause GT to fail.

--Andreas