OOV calculations

Deniz Yuret dyuret at ku.edu.tr
Thu Nov 1 00:49:51 PDT 2007

Thank you.

> You cannot compare LMs with different OOV counts.  You need to create a
> model that assigns a nonzero probability to every event.  E.g., you
> could have a letter-probability model for OOVS.

As for your suggestion of creating a letter-probability model for OOVs
(and maybe interpolating it with the ngram model), are there any
tools/documentation in the srilm package that could be helpful?  If
not I think we can (1) go into the source code and figure out how to
create a new letter-probability LM, or (2) create an independent
letter-probability LM outside srilm and manually interpolate its
results with the -debug 2 output of ngram.

I am assuming here (maybe contrary to your suggestion) that we can
create a model that assigns a nonzero probability to every event by
interpolating a regular ngram model (with OOVs > 0) and a
letter-probability model.


On 11/1/07, Andreas Stolcke <stolcke at speech.sri.com> wrote:
> In message <cea871f80710310302s53235d17x16ee2278d2170451 at mail.gmail.com>you wro
> te:
> > Hi,
> >
> > We are working on language models for agglutinative languages where
> > the number of unique tokens is comparatively large, and dividing words
> > into morphemes is useful.  When such divisions are performed (e.g.
> > represent each compound word as two tokens: stem+ and +suffix), number
> > of unique tokens and the number of OOV tokens are reduced, however it
> > becomes difficult to compare two such systems with different OOV
> > counts.
> >
> > Thus I started looking carefully into the ngram output, and so far
> > here is what I have understood, please correct me if I am wrong:
> >
> > 1. logprob is the log of the product of the probabilities for all
> > non-oov tokens (including </s>).
> correct.
> > 2. ppl = 10^(-logprob / (ntokens - noov + nsentences))
> correct.
> > 3. ppl1 = 10^(-logprob / (ntokens - noov))
> correct.
> > 4. I am not quite sure what zeroprobs gives.
> Words that are in the vocabulary but get probability 0 in the LM.
> They are treated the same as OOVs for the purpose of perplexity computation.
> > My first question is about a slight inconsistency in the calculation
> > of ppl1: the </s> probabilities are included in logprob, however their
> > count is not included in the denominator.  Shouldn't we have a
> > separate logprob total that excludes </s> for the ppl1 calculation?
> No, because the idea is that sentence boundaries are arbitrary
> and only a construct used by the LM to assign probabilities to words.
> So to compare two LMs that use a different sentence segmentation you
> need to normalize by the number of words excluding the </s> (which differ),
> but you need to include the probability assigned to </s> because they
> are part of the total probability the LMs assign to the complete word
> sequence.  e.g.:  P(a b c) = P(a) P(b | a) P(</s> | a b) P(c | a b <s>)
> if the LM happens to require a sentence boundary between b and c.
> Actually, that's an approximation because you really need to sum over
> all possible positions of sentence boundaries.
> To compute the full probability summing over all segmentations
> you need to run a "hidden event" N-gram model, implemented by
> ngram -hidden-vocab (see man page).
> > My second question is what exactly does zeroprobs give?
> See above.  If prob = 0 the perplexity becomes undefined (or infinity),
> so you need to remove them from the computation somehow (like OOVs).
> >
> > My final question is on how to fairly compare two models which divide
> > the same data into different numbers of tokens and have different OOV
> > counts.  It seems like the change in the number of tokens can be dealt
> > with comparing the probabilities assigned to the whole data set
> > (logprob) rather than per token averages (ppl).  However the current
> > output totally ignores the penalty that should be incurred from OOV
> > tokens.  As an easy solution, one can designate a fixed penalty for
> > each OOV token to be added to the logprob total.  It is not clear how
> > that fixed penalty should be determined.  A better solution is to have
> > a character-based model that assigns a non-zero probability to every
> > word and maybe interpolate it with the token-based model.  I am not
> > quite sure how this is possible in the srilm framework.
> You cannot compare LMs with different OOV counts.  You need to create a
> model that assigns a nonzero probability to every event.  E.g., you
> could have a letter-probability model for OOVS.
> As for comparing LMs with different number of tokens, that's easy.
> You are really comparing the total probabilties assigned to the complete
> observation sequence, however the various LMs choose to split up that
> sequence.  So look at the "logprob" output, not ppl.   If you want to
> report ppls just choose one token sequence as your reference and use that
> number of tokens in the denominator of the ppl computation for ALL LMs
> (you have to compute ppl from logprob yourself).
> Andreas

More information about the SRILM-User mailing list