perplexity evaluation

Tue Dec 3 06:13:11 PST 2002

Hi all, 

I'm a new user of the toolkit and I need a little bit support in order to
understand how the perplexity is computed and why it is different from the
expected value.

For instance, I have the training data in the file train.text that contain
only a line:
<s> a b c </s>
and the vocabulary (train.vocab) that contains all these words, and I want
to generate a LM based on unigram only and to evaluate it on the same
training data. I don't want any discounting strategy to be applied. 
Here are the commands I used:

ngram-count -order 1 -vocab train.vocab -text train.text -lm lm.arpa gt1max
0
ngram -lm out.arpa -debug 2 -vocab train.vocab -ppl train.text > out.ppl

So, according to the theory, the expected value for perplexity is PP=3 if
the context cues are not taken into account. This is also what one can get
using CMU toolkit. 
Using this toolkit and the above commands what I've got actually, is PP=4.
Looking inside of the created arpa model , I could see that </s> has the
same probability as any of the real word (a, b,c). 
Does anybody could explain me why is like this? Did I make a mistake or is
something that miss me?

Thank you in advance for your support, 
Zica