stolcke at speech.sri.com
Mon Feb 12 09:41:23 PST 2007
Martha Yifiru wrote:
> I want to compare morph-based language model with
> word-based one. To do this I have to do some
> manipulation on the calculation of perplexity for
> morph-based language model so as to have fair
> comparison. I was thinking that the source code for
> perplexity calculation is in ngram.cc but it does not
> seem that the actual perplexity calculation is in
> Can anyone help me?
The source code for perplexity computation is in lm/src/TextStats.cc .
However, there is no need to modify the code.
When you have different token counts (words versus morphs) the
perplexities are no longer comparable, but the log probabilities are.
You can get the log probability from the perplexity output, e.g.:
file ../ngram-count-gt/eval97.text: 5290 sentences, 38238 words, 681 OOVs
0 zeroprobs, logprob= -86334.6 ppl= 103.502 ppl1= 198.958
Assume the "words" in this example are actually morphs, and the actual
of words (including sentence boundaries) is less, say, 25000. then the
10^ -(-86334.6 / 25000 ) = 2840.43
More information about the SRILM-User