GT discounting and backoff
Andreas Stolcke
stolcke at speech.sri.com
Sun Dec 28 15:06:51 PST 2003
Roy,
In message <000901c3c7c9$4dedf460$34284484 at cs.technion.ac.il>you wrote:
> Hi,
>
> I have a few questions about the implementation of GT-discounting and
> Katz backoff in ngram-count.
>
> 1. What is the default value of gtNmin and gtNmax in ngram-count?
It differs for different N. Run ngram-count -help to see all the
default parameters.
>
> 2. Is backing off done only for ngrams that don't appear in the language
> model at all, or for ngrams that appear less than k>0 times (and what is
> this k). If I want backing off to be done only for counts below some k,
> should I set gtNmin to that value?
Exactly. However, for all N-grams in the *language model* the corresponding
conditional N-gram probability is used, always. So the cutoffs refer not to
the LM itself, but to the counts in the *training data*.
>
> 3. What does the following warning mean:
>
> warning: discount coeff 4 is out of range
>
> Does it mean that the discount for ngrams that appears only 4 times is
> very small? Why is it a warning?
The warning indicates that the GT discount formula yields a value outside
the range 0...1, and therefore cannot be used. This happens when your
counts-of-counts (how many singleton, 2-counts, 3-counts, etc.) are not
smoothly distributed, usually as the result of insufficient data, or
some artificial manipulation of the data (e.g. duplicating some portion
of it). ngram-count simply disables discounting for those ngrams.
If you get this a lot you can try some of the other smoothing methods.
Witten-Bell, for example, is very robust to the kinds of problems that
cause GT to fail.
--Andreas
More information about the SRILM-User
mailing list