OOV calculations

Sat Nov 17 14:43:20 PST 2007

Hi,

I had some interesting observations while trying to build a letter
based model.  My text file contains a word on each line with letters
separated by spaces.

1. kndiscount gives an error for this data file even though
ukndiscount seems to work.  Is this a bug?

ngram-count -kndiscount -order 3 -lm foo3.lm -text turkish.train.oov.split
one of modified KneserNey discounts is negative
error in discount estimator for order 1

2. ukndiscount accepts the -interpolate option and in fact does better
with it.  According to the documentation only wbdiscount, cdiscount,
and kndiscount are supposed to work with interpolate.  I checked the
output with -debug 3 and all probabilities seem to add up to 1.  Is
the documentation out of date?

3. Training with -order k and then testing with -order n does not give
the same results as training with -order n and testing with -order n.
Is this normal?  Which discounting methods should give equal results?

deniz

On Nov 1, 2007 5:27 PM, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>
> In message <cea871f80711010049p5563bce5ib575ec42ab432dcd at mail.gmail.com>you wro
> te:
> > Thank you.
> >
> > > You cannot compare LMs with different OOV counts.  You need to create a
> > > model that assigns a nonzero probability to every event.  E.g., you
> > > could have a letter-probability model for OOVS.
> >
> > As for your suggestion of creating a letter-probability model for OOVs
> > (and maybe interpolating it with the ngram model), are there any
> > tools/documentation in the srilm package that could be helpful?  If
> > not I think we can (1) go into the source code and figure out how to
> > create a new letter-probability LM, or (2) create an independent
> > letter-probability LM outside srilm and manually interpolate its
> > results with the -debug 2 output of ngram.
> >
> > I am assuming here (maybe contrary to your suggestion) that we can
> > create a model that assigns a nonzero probability to every event by
> > interpolating a regular ngram model (with OOVs > 0) and a
> > letter-probability model.
>
> Actually, I wasn't thinking of covering all words with a letter
> probability model (which would be poor for non-OOV words) and
> interpolating.  A more typical approach is to have a word LM with an
> OOV token, and when you are inside the OOV you assign a probability to
> the specific word by a letter LM.  so the total probability of
>
>         p(a b c) where "b" is an OOV would be
>
>
> p(a | ...) p(OOV | a) p(b| OOV) p(c | a OOV)  and
>
> p(b|OOV) is given by a totally separate LM that operates in terms of letters.
>
> Obviously this isn't implemented in SRILM at this point, but you can compute
> total probabilities, perplexities, etc. by first running the word LM, then
> the letter LM just on the OOVs in your test set, and adding the log
> probabilities.
>
> Andreas
>