OOV calculations

Wed Oct 31 03:02:33 PDT 2007

Hi,

We are working on language models for agglutinative languages where
the number of unique tokens is comparatively large, and dividing words
into morphemes is useful.  When such divisions are performed (e.g.
represent each compound word as two tokens: stem+ and +suffix), number
of unique tokens and the number of OOV tokens are reduced, however it
becomes difficult to compare two such systems with different OOV
counts.

Thus I started looking carefully into the ngram output, and so far
here is what I have understood, please correct me if I am wrong:

1. logprob is the log of the product of the probabilities for all
non-oov tokens (including </s>).
2. ppl = 10^(-logprob / (ntokens - noov + nsentences))
3. ppl1 = 10^(-logprob / (ntokens - noov))
4. I am not quite sure what zeroprobs gives.

My first question is about a slight inconsistency in the calculation
of ppl1: the </s> probabilities are included in logprob, however their
count is not included in the denominator.  Shouldn't we have a
separate logprob total that excludes </s> for the ppl1 calculation?

My second question is what exactly does zeroprobs give?

My final question is on how to fairly compare two models which divide
the same data into different numbers of tokens and have different OOV
counts.  It seems like the change in the number of tokens can be dealt
with comparing the probabilities assigned to the whole data set
(logprob) rather than per token averages (ppl).  However the current
output totally ignores the penalty that should be incurred from OOV
tokens.  As an easy solution, one can designate a fixed penalty for
each OOV token to be added to the logprob total.  It is not clear how
that fixed penalty should be determined.  A better solution is to have
a character-based model that assigns a non-zero probability to every
word and maybe interpolate it with the token-based model.  I am not
quite sure how this is possible in the srilm framework.

Any advice would be appreciated.

best,
deniz