Perplexity

Mon Feb 12 09:41:23 PST 2007

Martha Yifiru wrote:
> Hi,
>
> I want to compare morph-based language model with
> word-based one. To do this I have to do some
> manipulation on the calculation of perplexity for
> morph-based language model so as to have fair
> comparison. I was thinking that the source code for
> perplexity calculation is in ngram.cc but it does not
> seem that the actual perplexity calculation is in
> ngram.cc.
>
> Can anyone help me?
>
>   
The source code for perplexity computation is in lm/src/TextStats.cc .
However, there is no need to modify the code.
When you have different token counts (words versus morphs) the
perplexities are no longer comparable, but the log probabilities are.
You can get the log probability from the perplexity output, e.g.:

file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 681 OOVs
0 zeroprobs, logprob= -86334.6 ppl= 103.502 ppl1= 198.958
                                   ^^^^^^^^
Assume the "words" in this example are actually morphs, and the actual 
number
of words (including sentence boundaries) is less, say, 25000.  then the 
word-perplexity is

    10^ -(-86334.6 / 25000 ) = 2840.43

--Andreas