OOV calculations
Andreas Stolcke
stolcke at speech.sri.com
Sun Nov 18 21:25:54 PST 2007
In message <cea871f80711171443x5861dae8w3cc9d25388e75806 at mail.gmail.com>you wro
te:
> Hi,
>
> I had some interesting observations while trying to build a letter
> based model. My text file contains a word on each line with letters
> separated by spaces.
>
> 1. kndiscount gives an error for this data file even though
> ukndiscount seems to work. Is this a bug?
>
> ngram-count -kndiscount -order 3 -lm foo3.lm -text turkish.train.oov.split
> one of modified KneserNey discounts is negative
> error in discount estimator for order 1
This is quite possible as the formulae used by the two methods differ.
Also, the count-of-count statistics used may be quite atypical given
that the number of distinct unigram types is limited and small for letter-based
models.
>
> 2. ukndiscount accepts the -interpolate option and in fact does better
> with it. According to the documentation only wbdiscount, cdiscount,
> and kndiscount are supposed to work with interpolate. I checked the
> output with -debug 3 and all probabilities seem to add up to 1. Is
> the documentation out of date?
ukndiscount was added later, and the man page was not fully updated, it seems.
ukndiscount is certainly supposed to support -interpolate .
> 3. Training with -order k and then testing with -order n does not give
> the same results as training with -order n and testing with -order n.
> Is this normal? Which discounting methods should give equal results?
This is a known (and desired) feature of KN (original and modified)
discounting. KN treats the highest-order N-grams differently from the
lower-order ones, and the lower-order N-grams are not supposed to be
used by themselves. The reason is that the lower-order estimates are
specifically chosen to work well as backoffs, not as standalone estimates.
Andreas
More information about the SRILM-User
mailing list