[SRILM User List] (no subject)

Tue Sep 4 00:46:32 PDT 2012

On 9/2/2012 10:10 PM, hic et nunc wrote:
> hello again. i have a new question about lm ngram probs.
> as you know well, in lm file, the log probs are calculated like this: 
> log [(count[n-gram]*d/count[(n-1)-gram] - count[(n-1)-gram_<unk>]]
> sometimes 1 is added to denominator, but sometimes not. what is the 
> reason of this?
One is added to the denominator only a  last resort when the smoothing 
results in n-gram probabilities that sum to 1.
The following comment in NgramLM.cc explains why:

>             /*
>              * This is a hack credited to Doug Paul (by Roni Rosenfeld in
>              * his CMU tools).  It may happen that no probability mass
>              * is left after totalling all the explicit probs, typically
>              * because the discount coefficients were out of range and
>              * forced to 1.0.  Unless we have seen all vocabulary words in
>              * this context, to arrive at some non-zero backoff mass,
>              * we try incrementing the denominator in the estimator by 1.
>              * Another hack: If the discounting method uses interpolation
>              * we first try disabling that because interpolation removes
>              * probability mass.
>              */

This happens occasionally with GT smoothing due to degenerate 
count-of-counts statistics.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120904/f57ff09a/attachment.html>