KN discounting and zeroton words

Sat Jun 11 20:40:28 PDT 2005

In message <1118075939.16700.23.camel at localhost>you wrote:
> 
> A little correction: also with KN discounting, zeroton words get the
> same unigram probability as words discounted to zero (using -gt1min).
> What I don't understand, is why can this probability be higher than for
> words that are not discounted to zero? E.g.
> 
> E.g. for a very little test set, and using '-gt1min 2', zeroton and
> singleton words get a probability -0.7323937, but a word occurring twice
> gets a probability -1.556303. 
> 
> I believe this is some magic property of KN discounting, in which case I
> apologize for polluting the list and go back to reading the description
> of the algorithm.

The unigram probabilities for zeroton words are obtained by distributing 
the backoff mass left by the non-zeroton words evenly over all the zerotons
(this corresponds to backing off to a uniform distribution).
Now, if the number of zerotons is small they might actually get more 
probability than the low-count observed unigrams that way.

The -interpolate1 option should prevent this since it distributes the 
backoff mass over ALL unigrams (adding to the probability of those words
that were observed).
Please check if this is the case, and if not, send me a test case so
I can look into why it doesn't work as intended.

--Andreas