[SRILM User List] how are the probabilities computed in ngram-count

Saman Noorzadeh saman_2004 at yahoo.com
Wed Apr 11 05:48:54 PDT 2012


Thank you, 
-cdiscount 0 works perfectly, but now that I have read about smoothing and different methods of discounting I have another question:


I want to know your ideas about this problem:
I want to have a model out of a text. and then predict what the user is typing (a word prediction approach). at any moment I will predict what the next character would be according to my bigrams.
Do you think methods of discounting and smoothing are useful in treating the training data?
or it is more appropriate if I just disable it?

Thank you
Saman




________________________________
 From: Andreas Stolcke <stolcke at icsi.berkeley.edu>
To: Saman Noorzadeh <saman_2004 at yahoo.com> 
Cc: Srilm group <srilm-user at speech.sri.com> 
Sent: Wednesday, April 11, 2012 1:46 AM
Subject: Re: [SRILM User List] how are the probabilities computed in ngram-count
 

On 4/10/2012 1:29 AM, Saman Noorzadeh wrote: 
Hello
>I am getting confused about the models that ngram-count make:
>ngram-count -order 2  -write-vocab vocabulary.voc -text mytext.txt   -write model1.bo
>ngram-count -order 2  -read model1.bo -lm model2.BO
>
>
>forexample: (the text is very large and these words are just a sample)
>
>
>
>in model1.bo:
>cook   14 
>
>cook was 1
>
>
>in model2.BO:
>-1.904738  cook was 
>
>
>my question is that the probability of 'cook was' bigram should be log10(1/14), but ngram-count result shows: log(1/80)== -1.9047
>how is these probabilities computed?
It's called "smoothing" or "discounting" and ensures that word
    sequences of ngrams never seen in the training data receive nonzero
    probability.
Please consult any of the basic LM tutorial sources listed at
    http://www.speech.sri.com/projects/srilm/manpages/, or specifically
    http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
    .

To obtain the unsmoothed probability estimates that you are
    expecting you need to change the parameters.  Try ngram-count 
    -addsmooth 0 .... 

Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120411/f4e125f8/attachment.html>


More information about the SRILM-User mailing list