Q: probabilities calculation

Andreas Stolcke stolcke at speech.sri.com
Sun Oct 27 10:15:16 PST 2002


Bing,

words with zero unigram counts can still get a non-zero probability as a 
result of
probability smoothing.   The discounting method applied to unigrams will 
cause the
total probability mass of the oberserved unigrams to be less than zero. 
 SRILM
then effectively implements a backing off to a "zero-gram" (uniform) 
distribution.
Since the DARPA format has no provision for such a backoff this is done 
implicitly:
If there is at least one word with zero counts (sometimes called a 
"zeroton") then the left-over
unigram probability mass is distributed evenly over all zeroton words. 
 If  all words in
the vocabuary had non-zero counts (i.e., no zerotons) then the left-over 
probability
is split evenly among all words and added to the previously estimated 
unigram probabilities.

This is all implemented in Ngram::distributeProb(), which in turn is 
invoked as part of
the backoff weight normalization step.

So the short answer is that depending on the discounting method chosen 
for unigrams,
zerotons get some non-zero probabiility via backoff to a uniform 
distribution.
If you want to disable that you just  need to disable unigram 
discounting (-gt1max 0).

I hope this answers your question.

--Andreas

Bing Jing wrote:

>Hello there,
>
>Does anyone know how the SRI tool generate
>unigram probabilities for the words that NOT
>occur in the training transcript but covered
>by the training dictionary? As I read
>the NgramLM.cc, I think all those words are
>assigned a probability as LogP_Zero, but it 
>seems to me that this value is various regarding
>different LMs. 
>
>I used two sets of quite small transcription to
>train LMs, and use the same training dictionary (
>46K). The number of unique words in trans1 and trans2
>are 620 and 700, respectively. And for those words
>that covered by the lexicon but now in the training
>trans, the unigram probabilities are -5.337341 and 
>-5.383736, respectively. I still can't figure out how
>these two numbers are generated. 
>
>Thanks in advance!
>
>Bing
>
>
>
>
>__________________________________________________
>Do you Yahoo!?
>Y! Web Hosting - Let the expert host your web site
>http://webhosting.yahoo.com/
>  
>





More information about the SRILM-User mailing list