[SRILM User List] Question about ngram-count
Andreas Stolcke
stolcke at speech.sri.com
Wed Aug 18 16:28:13 PDT 2010
wei chen wrote:
> Hi all,
> I trained a LM model using the default discount algorithm of
> ngram-count successfully, but in one experiement, I removed some
> training data, and then increased the number of other trainind data to
> keep the number of the total training set fixed, but the some
> message occured:
>
> warning: discount coeff 1 is out of range: -0
> warning: discount coeff 1 is out of range: -2.09472
> warning: discount coeff 3 is out of range: 0.966989
> warning: discount coeff 5 is out of range: 0.990832
> warning: discount coeff 7 is out of range: 0.998723
> warning: discount coeff 1 is out of range: -4.55137
> warning: discount coeff 3 is out of range: 0.988902
> And the training process became very slow, I did not know why.
The FAQ says:
>
> *C3) Why am I getting errors or warnings from the smoothing method I'm
> using? *
> The Good-Turing and Kneser-Ney smoothing methods rely on
> statistics called "count-of-counts", the number of words occurring
> one, twice, three times, etc. The formulae for these methods
> become undefined if the counts-of-counts are zero, or not strictly
> decreasing. Some conditions are fatal (such as when the count of
> singleton words is zero), others lead to less smoothing (and
> warnings). To avoid these problems, check for the following
> possibilities:
>
> a)
> The data could be very sparse, i.e., the training corpus very
> small. Try using the Witten-Bell discounting method.
> b)
> The vocabulary could be very small, such as when training an
> LM based on characters or parts-of-speech. Smoothing is less
> of an issue in those cases, and the Witten-Bell method should
> work well.
> c)
> The data was manipulated in some way, or artificially
> generated. For example, duplicating data eliminates the
> odd-numbered counts-of-counts.
>
This is my guess as to what happened. Did you duplicate some of your
data? Even if it is an artificial mix of several sources you can get
count-of-count statistics that lead to errors in the GT discount estimator.
>
>
> d)
> The vocabulary is limited during counts collection using the
> *ngram-count* * -vocab * option, with the effect that many
> low-frequency N-grams are eliminated. The proper approach is
> to compute smoothing parameters on the full vocabulary. This
> happens automatically in the * make-big-lm * wrapper script,
> which is preferable to direct use of *ngram-count* for other
> reasons (see issue B3-a above).
> e)
> You are estimating an LM from N-gram counts that have been
> truncated beforehand, e.g., by removing singleton events. If
> you cannot go back to the original data and recompute the
> counts there is a heuristic to extrapolate low
> counts-of-counts from higher ones. The heuristic is invoked
> automatically (and an informational message is output) when *
> make-big-lm * is used to estimate LMs with Kneser-Ney
> smoothing. For details see the paper by W. Wang et al. in
> ASRU-2007, listed under "SEE ALSO".
>
If you cannot fix the problem, try using a different smoothing method,
like Witten Bell.
Andreas
More information about the SRILM-User
mailing list