[SRILM User List] Question on unsmoothed estimates

Andreas Stolcke stolcke at icsi.berkeley.edu
Mon Jan 28 13:47:34 PST 2013

On 1/28/2013 1:38 PM, Avneesh Saluja wrote:
> Hello Andreas,
> I hope you're doing well.  I have a quick question on SRILM and its 
> ability to compute completely unsmoothed probability estimates.  Of 
> course, I can use the counts output of ngram-count and then compute 
> probabilities from there, but since ngram-count already does this, I 
> thought I should use that facility, but I'm not able to get it to do 
> what I want.
> Here's an example, with a small LM consisting of a training corpus of 
> only 12,534 words (using "wc" on the file).  There are 1872 unigrams 
> (as per the LM output).  The exact command I used to generate my LM is:
>  ~/tools/srilm/bin/i686-m64/ngram-count -order 3 -text 
> ../data/lm_training/small/train.txt -cdiscount 0 -lm unsmoothed-lm
> First, I see that the word "accident" occurs 5 times in my corpus. 
>  Therefore, one would expect the unigram probability to be 
> log10(5/12534) = -3.40.  However, the result in SRILM is -3.45, 
> indicating some sort of smoothing going on.
The end-of-sentence tokens also count as events in the model.  So your 
denominator is larger than you assume, hence the lower probability estimate.

If you use ngram-count -debug 4 you will see exactly what quantities go 
into the estimation of each ngram probability.

> Furthermore, when looking at higher order n-grams, I see that there 
> are only 2 trigrams where the first two words are "hilton hotel" --> 
> "hilton hotel ?" and "hilton hotel ,", the count of the former is 2 
> and the count of the latter is 1.  However, in the resulting 
> unsmoothed n-gram, I only see the former entry "hilton hotel ?", and 
> it has the right log probability (-0.176 --> 10^(-0.176) = 0.67), but 
> I can't find the entry "hilton hotel ,", which should have a log 
> probability of log10(1/3) = -0.477.  However, for another instance, 
> say the bigrams w_1, w_2 where w_1 = "twelve", I get the correct 
> probability estimates for the bigrams.
By default trigrams (and 4grams, etc. ) that occur only once are omitted 
from the LM.  Use -gt3min 1 to change that.


More information about the SRILM-User mailing list