Entropy going smaller as corpus goes smaller.

Mon Mar 10 12:36:21 PDT 2008

Hi everyone,

I have computed the entropy for my model with the following command:

ngram -lm small_1.lm -counts small_1.cnt -counts-entropy

where small_1.lm is a trigram model with wbdiscount created from ngram-count and where small_1.cnt is a count file only including the events we want to predict.

The output is this:

file small_1.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -4.69264 ppl= 49276.1 ppl1= 49276.1

This model is really trained using a subset of my TRAIN.txt corpus. This model also gives the following ppl against an unseen test set:

file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs
0 zeroprobs, logprob= -38595.4 ppl= 339.378 ppl1= 476.644

On the other hand I have another model also a subset of my TRAIN.txt but a different subset from small_1.lm with entropy as follows:

file small_2.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -4.03253 ppl= 10777.8 ppl1= 10777.8

and a perplexity against the same unseen test set of:

file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs
0 zeroprobs, logprob= -38792.5 ppl= 349.627 ppl1= 491.891

So my question is, why is entropy bigger in the model whose ppl is actually the smallest? I thought that both measures could be used to measure the performance or quality of a language model. How can both numbers be so inconsistent? By the way my TRAIN.lm (model created from the whole of the training corpus) has an entropy of:

file TRAIN_EVENTS.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -11.5557 ppl= 3.59464e+11 ppl1= 3.59464e+11

which is humongous!

I am a complete beginner in this field and this is really not making any sense.

Any help will be greatly appreciated.

Regards to all,

Sai

_________________________________________________________________
MSN Video. 
http://video.msn.com/?mkt=es-es