ngram-discount
Andreas Stolcke
stolcke at speech.sri.com
Wed May 13 11:06:06 PDT 2009
If the model is smoothed (the default), zeroprobs typically occur for
out-of-vocabulary words.
You need to train a model that assigns probability to the unknown word
(<unk>).
Use the ngram-count -unk option (you need to also specify a predefined
vocabulary to there are OOV words in your training data that you can get
a probability estimate from). Then use ngram -unk to test the LM.
Hope this helps,
Andreas
王秋锋 wrote:
> hi,
> when I used the srilm, I found the zeroprobs of n-gram. So why will
> zeroprobs turn up?
> I used the bigram. so when I calculated p(w2|w1), if C(w1w2)=0, the
> prob backoff to unigram:alpha*P(w2);
> and if C(w2)=0 (maybe it is out-of-vocabulary),we can backoff to
> zerogram,like uniform distribution; or we use good-turing discount,
> we have some discounts which can be used to this zero count word. so I
> think zeroprobs will not turn up.
> Do I understand it right?
> or the unigram is calculated by maximum likelihood directly,like
> p(w2)=C(w2)/(all counts)?
> so why not be calculated by good-turing discount,like
> p*(w2)=C*(w2)/(all counts). (C*(w2) is calculated by good-turing).
> Thank you very much.
> Sincerely yours,
> Wang
> 2009-05-09
> ------------------------------------------------------------------------
> 王秋锋
More information about the SRILM-User
mailing list