KN discounting and zeroton words
tanel.alumae at aqris.com
Mon Jun 6 09:38:59 PDT 2005
A little correction: also with KN discounting, zeroton words get the
same unigram probability as words discounted to zero (using -gt1min).
What I don't understand, is why can this probability be higher than for
words that are not discounted to zero? E.g.
E.g. for a very little test set, and using '-gt1min 2', zeroton and
singleton words get a probability -0.7323937, but a word occurring twice
gets a probability -1.556303.
I believe this is some magic property of KN discounting, in which case I
apologize for polluting the list and go back to reading the description
of the algorithm.
On Mon, 2005-06-06 at 19:03 +0300, Tanel Alumäe wrote:
> I've noticed that when using -kndiscount, the zeroton words (words that
> are in the vocabulary but not in the training corpus) get a higher
> unigram LM probability than words that actually occur (rarely) in the
> training corpus. Shouldn't the zeroton words get the same unigram
> probability as the words that are discounted to 0 using the -gt1min
> With GT, WB and natural discounting, everything works as expected:
> zeroton words get the same unigram probability as the words discounted
> to 0.
> Tanel A.
More information about the SRILM-User