[SRILM User List] how are the probabilities computed in ngram-count
Saman Noorzadeh
saman_2004 at yahoo.com
Wed Apr 11 05:48:54 PDT 2012
Thank you,
-cdiscount 0 works perfectly, but now that I have read about smoothing and different methods of discounting I have another question:
I want to know your ideas about this problem:
I want to have a model out of a text. and then predict what the user is typing (a word prediction approach). at any moment I will predict what the next character would be according to my bigrams.
Do you think methods of discounting and smoothing are useful in treating the training data?
or it is more appropriate if I just disable it?
Thank you
Saman
________________________________
From: Andreas Stolcke <stolcke at icsi.berkeley.edu>
To: Saman Noorzadeh <saman_2004 at yahoo.com>
Cc: Srilm group <srilm-user at speech.sri.com>
Sent: Wednesday, April 11, 2012 1:46 AM
Subject: Re: [SRILM User List] how are the probabilities computed in ngram-count
On 4/10/2012 1:29 AM, Saman Noorzadeh wrote:
Hello
>I am getting confused about the models that ngram-count make:
>ngram-count -order 2 -write-vocab vocabulary.voc -text mytext.txt -write model1.bo
>ngram-count -order 2 -read model1.bo -lm model2.BO
>
>
>forexample: (the text is very large and these words are just a sample)
>
>
>
>in model1.bo:
>cook 14
>
>cook was 1
>
>
>in model2.BO:
>-1.904738 cook was
>
>
>my question is that the probability of 'cook was' bigram should be log10(1/14), but ngram-count result shows: log(1/80)== -1.9047
>how is these probabilities computed?
It's called "smoothing" or "discounting" and ensures that word
sequences of ngrams never seen in the training data receive nonzero
probability.
Please consult any of the basic LM tutorial sources listed at
http://www.speech.sri.com/projects/srilm/manpages/, or specifically
http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
.
To obtain the unsmoothed probability estimates that you are
expecting you need to change the parameters. Try ngram-count
-addsmooth 0 ....
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120411/f4e125f8/attachment.html>
More information about the SRILM-User
mailing list