OOV calculations

Thu Nov 1 08:27:38 PDT 2007

In message <cea871f80711010049p5563bce5ib575ec42ab432dcd at mail.gmail.com>you wro
te:
> Thank you.
> 
> > You cannot compare LMs with different OOV counts.  You need to create a
> > model that assigns a nonzero probability to every event.  E.g., you
> > could have a letter-probability model for OOVS.
> 
> As for your suggestion of creating a letter-probability model for OOVs
> (and maybe interpolating it with the ngram model), are there any
> tools/documentation in the srilm package that could be helpful?  If
> not I think we can (1) go into the source code and figure out how to
> create a new letter-probability LM, or (2) create an independent
> letter-probability LM outside srilm and manually interpolate its
> results with the -debug 2 output of ngram.
> 
> I am assuming here (maybe contrary to your suggestion) that we can
> create a model that assigns a nonzero probability to every event by
> interpolating a regular ngram model (with OOVs > 0) and a
> letter-probability model.

Actually, I wasn't thinking of covering all words with a letter
probability model (which would be poor for non-OOV words) and
interpolating.  A more typical approach is to have a word LM with an
OOV token, and when you are inside the OOV you assign a probability to
the specific word by a letter LM.  so the total probability of

	p(a b c) where "b" is an OOV would be 

p(a | ...) p(OOV | a) p(b| OOV) p(c | a OOV)  and 

p(b|OOV) is given by a totally separate LM that operates in terms of letters.

Obviously this isn't implemented in SRILM at this point, but you can compute
total probabilities, perplexities, etc. by first running the word LM, then
the letter LM just on the OOVs in your test set, and adding the log
probabilities.

Andreas