OOV calculations

Sun Nov 18 21:25:54 PST 2007

In message <cea871f80711171443x5861dae8w3cc9d25388e75806 at mail.gmail.com>you wro
te:
> Hi,
> 
> I had some interesting observations while trying to build a letter
> based model.  My text file contains a word on each line with letters
> separated by spaces.
> 
> 1. kndiscount gives an error for this data file even though
> ukndiscount seems to work.  Is this a bug?
> 
> ngram-count -kndiscount -order 3 -lm foo3.lm -text turkish.train.oov.split
> one of modified KneserNey discounts is negative
> error in discount estimator for order 1

This is quite possible as the formulae used by the two methods differ.
Also, the count-of-count statistics used may be quite atypical given
that the number of distinct unigram types is limited and small for letter-based
models.

> 
> 2. ukndiscount accepts the -interpolate option and in fact does better
> with it.  According to the documentation only wbdiscount, cdiscount,
> and kndiscount are supposed to work with interpolate.  I checked the
> output with -debug 3 and all probabilities seem to add up to 1.  Is
> the documentation out of date?

ukndiscount was added later, and the man page was not fully updated, it seems.
ukndiscount is certainly supposed to support -interpolate .

> 3. Training with -order k and then testing with -order n does not give
> the same results as training with -order n and testing with -order n.
> Is this normal?  Which discounting methods should give equal results?

This is a known (and desired) feature of KN (original and modified)
discounting.  KN treats the highest-order N-grams differently from the
lower-order ones, and the lower-order N-grams are not supposed to be 
used by themselves.  The reason is that the lower-order estimates are
specifically chosen to work well as backoffs, not as standalone estimates.

Andreas