[SRILM User List] how are the probabilities computed in ngram-count
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Apr 10 16:46:54 PDT 2012
On 4/10/2012 1:29 AM, Saman Noorzadeh wrote:
> Hello
> I am getting confused about the models that ngram-count make:
> ngram-count -order 2 -write-vocab vocabulary.voc -text mytext.txt
> -write model1.bo
> ngram-count -order 2 -read model1.bo -lm model2.BO
>
> forexample: (the text is very large and these words are just a sample)
>
> in model1.bo:
> cook 14
> cook was 1
>
> in model2.BO:
> -1.904738 cook was
>
> my question is that the probability of 'cook was' bigram should be
> log10(1/14), but ngram-count result shows: log(1/80)== -1.9047
> how is these probabilities computed?
It's called "smoothing" or "discounting" and ensures that word sequences
of ngrams never seen in the training data receive nonzero probability.
Please consult any of the basic LM tutorial sources listed at
http://www.speech.sri.com/projects/srilm/manpages/, or specifically
http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html .
To obtain the unsmoothed probability estimates that you are expecting
you need to change the parameters. Try ngram-count -addsmooth 0 ....
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120410/729a296a/attachment.html>
More information about the SRILM-User
mailing list