[SRILM User List] Question about ngram-count

Wed Aug 18 16:28:13 PDT 2010

wei chen wrote:
> Hi all,
> I trained a LM model using the default discount algorithm of 
> ngram-count successfully, but in one experiement, I removed some 
> training data, and then increased the number of other trainind data to 
> keep the number of the total training set fixed,   but the some 
> message occured:
>
> warning: discount coeff 1 is out of range: -0
> warning: discount coeff 1 is out of range: -2.09472
> warning: discount coeff 3 is out of range: 0.966989
> warning: discount coeff 5 is out of range: 0.990832
> warning: discount coeff 7 is out of range: 0.998723
> warning: discount coeff 1 is out of range: -4.55137
> warning: discount coeff 3 is out of range: 0.988902 
> And the training process became very slow, I did not know why.

The FAQ says:
>
> *C3) Why am I getting errors or warnings from the smoothing method I'm 
> using? *
>     The Good-Turing and Kneser-Ney smoothing methods rely on
>     statistics called "count-of-counts", the number of words occurring
>     one, twice, three times, etc. The formulae for these methods
>     become undefined if the counts-of-counts are zero, or not strictly
>     decreasing. Some conditions are fatal (such as when the count of
>     singleton words is zero), others lead to less smoothing (and
>     warnings). To avoid these problems, check for the following
>     possibilities:
>
>     a)
>         The data could be very sparse, i.e., the training corpus very
>         small. Try using the Witten-Bell discounting method. 
>     b)
>         The vocabulary could be very small, such as when training an
>         LM based on characters or parts-of-speech. Smoothing is less
>         of an issue in those cases, and the Witten-Bell method should
>         work well. 
>     c)
>         The data was manipulated in some way, or artificially
>         generated. For example, duplicating data eliminates the
>         odd-numbered counts-of-counts.
>
This is my guess as to what happened.  Did you duplicate some of your 
data?  Even if it is an artificial mix of several sources you can get 
count-of-count statistics that lead to errors in the GT discount estimator.
>
>
>     d)
>         The vocabulary is limited during counts collection using the
>         *ngram-count* * -vocab * option, with the effect that many
>         low-frequency N-grams are eliminated. The proper approach is
>         to compute smoothing parameters on the full vocabulary. This
>         happens automatically in the * make-big-lm * wrapper script,
>         which is preferable to direct use of *ngram-count* for other
>         reasons (see issue B3-a above). 
>     e)
>         You are estimating an LM from N-gram counts that have been
>         truncated beforehand, e.g., by removing singleton events. If
>         you cannot go back to the original data and recompute the
>         counts there is a heuristic to extrapolate low
>         counts-of-counts from higher ones. The heuristic is invoked
>         automatically (and an informational message is output) when *
>         make-big-lm * is used to estimate LMs with Kneser-Ney
>         smoothing. For details see the paper by W. Wang et al. in
>         ASRU-2007, listed under "SEE ALSO".
>

If you cannot fix the problem, try using a different smoothing method, 
like Witten Bell.

Andreas