[SRILM User List] how are the probabilities computed in ngram-count

Tue Apr 10 16:46:54 PDT 2012

On 4/10/2012 1:29 AM, Saman Noorzadeh wrote:
> Hello
> I am getting confused about the models that ngram-count make:
> ngram-count -order 2  -write-vocab vocabulary.voc -text mytext.txt   
> -write model1.bo
> ngram-count -order 2  -read model1.bo -lm model2.BO
>
> forexample: (the text is very large and these words are just a sample)
>
> in model1.bo:
> cook   14
> cook was 1
>
> in model2.BO:
> -1.904738  cook was
>
> my question is that the probability of 'cook was' bigram should be 
> log10(1/14), but ngram-count result shows: log(1/80)== -1.9047
> how is these probabilities computed?

It's called "smoothing" or "discounting" and ensures that word sequences 
of ngrams never seen in the training data receive nonzero probability.
Please consult any of the basic LM tutorial sources listed at 
http://www.speech.sri.com/projects/srilm/manpages/, or specifically 
http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html .

To obtain the unsmoothed probability estimates that you are expecting 
you need to change the parameters.  Try ngram-count  -addsmooth 0 ....

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120410/729a296a/attachment.html>