[SRILM User List] (no subject)
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Sep 4 00:46:32 PDT 2012
On 9/2/2012 10:10 PM, hic et nunc wrote:
> hello again. i have a new question about lm ngram probs.
> as you know well, in lm file, the log probs are calculated like this:
> log [(count[n-gram]*d/count[(n-1)-gram] - count[(n-1)-gram_<unk>]]
> sometimes 1 is added to denominator, but sometimes not. what is the
> reason of this?
One is added to the denominator only a last resort when the smoothing
results in n-gram probabilities that sum to 1.
The following comment in NgramLM.cc explains why:
> /*
> * This is a hack credited to Doug Paul (by Roni Rosenfeld in
> * his CMU tools). It may happen that no probability mass
> * is left after totalling all the explicit probs, typically
> * because the discount coefficients were out of range and
> * forced to 1.0. Unless we have seen all vocabulary words in
> * this context, to arrive at some non-zero backoff mass,
> * we try incrementing the denominator in the estimator by 1.
> * Another hack: If the discounting method uses interpolation
> * we first try disabling that because interpolation removes
> * probability mass.
> */
This happens occasionally with GT smoothing due to degenerate
count-of-counts statistics.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120904/f57ff09a/attachment.html>
More information about the SRILM-User
mailing list