[SRILM User List] Question on unsmoothed estimates
stolcke at icsi.berkeley.edu
Mon Jan 28 13:47:34 PST 2013
On 1/28/2013 1:38 PM, Avneesh Saluja wrote:
> Hello Andreas,
> I hope you're doing well. I have a quick question on SRILM and its
> ability to compute completely unsmoothed probability estimates. Of
> course, I can use the counts output of ngram-count and then compute
> probabilities from there, but since ngram-count already does this, I
> thought I should use that facility, but I'm not able to get it to do
> what I want.
> Here's an example, with a small LM consisting of a training corpus of
> only 12,534 words (using "wc" on the file). There are 1872 unigrams
> (as per the LM output). The exact command I used to generate my LM is:
> ~/tools/srilm/bin/i686-m64/ngram-count -order 3 -text
> ../data/lm_training/small/train.txt -cdiscount 0 -lm unsmoothed-lm
> First, I see that the word "accident" occurs 5 times in my corpus.
> Therefore, one would expect the unigram probability to be
> log10(5/12534) = -3.40. However, the result in SRILM is -3.45,
> indicating some sort of smoothing going on.
The end-of-sentence tokens also count as events in the model. So your
denominator is larger than you assume, hence the lower probability estimate.
If you use ngram-count -debug 4 you will see exactly what quantities go
into the estimation of each ngram probability.
> Furthermore, when looking at higher order n-grams, I see that there
> are only 2 trigrams where the first two words are "hilton hotel" -->
> "hilton hotel ?" and "hilton hotel ,", the count of the former is 2
> and the count of the latter is 1. However, in the resulting
> unsmoothed n-gram, I only see the former entry "hilton hotel ?", and
> it has the right log probability (-0.176 --> 10^(-0.176) = 0.67), but
> I can't find the entry "hilton hotel ,", which should have a log
> probability of log10(1/3) = -0.477. However, for another instance,
> say the bigrams w_1, w_2 where w_1 = "twelve", I get the correct
> probability estimates for the bigrams.
By default trigrams (and 4grams, etc. ) that occur only once are omitted
from the LM. Use -gt3min 1 to change that.
More information about the SRILM-User