OOV calculations

Wed Oct 31 16:46:50 PDT 2007

In message <cea871f80710310302s53235d17x16ee2278d2170451 at mail.gmail.com>you wro
te:
> Hi,
> 
> We are working on language models for agglutinative languages where
> the number of unique tokens is comparatively large, and dividing words
> into morphemes is useful.  When such divisions are performed (e.g.
> represent each compound word as two tokens: stem+ and +suffix), number
> of unique tokens and the number of OOV tokens are reduced, however it
> becomes difficult to compare two such systems with different OOV
> counts.
> 
> Thus I started looking carefully into the ngram output, and so far
> here is what I have understood, please correct me if I am wrong:
> 
> 1. logprob is the log of the product of the probabilities for all
> non-oov tokens (including </s>).

correct.

> 2. ppl = 10^(-logprob / (ntokens - noov + nsentences))

correct.

> 3. ppl1 = 10^(-logprob / (ntokens - noov))

correct.

> 4. I am not quite sure what zeroprobs gives.

Words that are in the vocabulary but get probability 0 in the LM.
They are treated the same as OOVs for the purpose of perplexity computation.

> My first question is about a slight inconsistency in the calculation
> of ppl1: the </s> probabilities are included in logprob, however their
> count is not included in the denominator.  Shouldn't we have a
> separate logprob total that excludes </s> for the ppl1 calculation?

No, because the idea is that sentence boundaries are arbitrary 
and only a construct used by the LM to assign probabilities to words.
So to compare two LMs that use a different sentence segmentation you
need to normalize by the number of words excluding the </s> (which differ),
but you need to include the probability assigned to </s> because they 
are part of the total probability the LMs assign to the complete word 
sequence.  e.g.:  P(a b c) = P(a) P(b | a) P(</s> | a b) P(c | a b <s>)
if the LM happens to require a sentence boundary between b and c.
Actually, that's an approximation because you really need to sum over 
all possible positions of sentence boundaries.

To compute the full probability summing over all segmentations
you need to run a "hidden event" N-gram model, implemented by
ngram -hidden-vocab (see man page).

> My second question is what exactly does zeroprobs give?

See above.  If prob = 0 the perplexity becomes undefined (or infinity),
so you need to remove them from the computation somehow (like OOVs).

> 
> My final question is on how to fairly compare two models which divide
> the same data into different numbers of tokens and have different OOV
> counts.  It seems like the change in the number of tokens can be dealt
> with comparing the probabilities assigned to the whole data set
> (logprob) rather than per token averages (ppl).  However the current
> output totally ignores the penalty that should be incurred from OOV
> tokens.  As an easy solution, one can designate a fixed penalty for
> each OOV token to be added to the logprob total.  It is not clear how
> that fixed penalty should be determined.  A better solution is to have
> a character-based model that assigns a non-zero probability to every
> word and maybe interpolate it with the token-based model.  I am not
> quite sure how this is possible in the srilm framework.

You cannot compare LMs with different OOV counts.  You need to create a 
model that assigns a nonzero probability to every event.  E.g., you 
could have a letter-probability model for OOVS.

As for comparing LMs with different number of tokens, that's easy.
You are really comparing the total probabilties assigned to the complete
observation sequence, however the various LMs choose to split up that 
sequence.  So look at the "logprob" output, not ppl.   If you want to 
report ppls just choose one token sequence as your reference and use that
number of tokens in the denominator of the ppl computation for ALL LMs
(you have to compute ppl from logprob yourself).

Andreas