Q: probabilities calculation
stolcke at speech.sri.com
Sun Oct 27 10:15:16 PST 2002
words with zero unigram counts can still get a non-zero probability as a
probability smoothing. The discounting method applied to unigrams will
total probability mass of the oberserved unigrams to be less than zero.
then effectively implements a backing off to a "zero-gram" (uniform)
Since the DARPA format has no provision for such a backoff this is done
If there is at least one word with zero counts (sometimes called a
"zeroton") then the left-over
unigram probability mass is distributed evenly over all zeroton words.
If all words in
the vocabuary had non-zero counts (i.e., no zerotons) then the left-over
is split evenly among all words and added to the previously estimated
This is all implemented in Ngram::distributeProb(), which in turn is
invoked as part of
the backoff weight normalization step.
So the short answer is that depending on the discounting method chosen
zerotons get some non-zero probabiility via backoff to a uniform
If you want to disable that you just need to disable unigram
discounting (-gt1max 0).
I hope this answers your question.
Bing Jing wrote:
>Does anyone know how the SRI tool generate
>unigram probabilities for the words that NOT
>occur in the training transcript but covered
>by the training dictionary? As I read
>the NgramLM.cc, I think all those words are
>assigned a probability as LogP_Zero, but it
>seems to me that this value is various regarding
>I used two sets of quite small transcription to
>train LMs, and use the same training dictionary (
>46K). The number of unique words in trans1 and trans2
>are 620 and 700, respectively. And for those words
>that covered by the lexicon but now in the training
>trans, the unigram probabilities are -5.337341 and
>-5.383736, respectively. I still can't figure out how
>these two numbers are generated.
>Thanks in advance!
>Do you Yahoo!?
>Y! Web Hosting - Let the expert host your web site
More information about the SRILM-User