Entropy going smaller as corpus goes smaller.
SAI TANG HUANG
sai_tang_huang at hotmail.com
Mon Mar 10 12:36:21 PDT 2008
Hi everyone,
I have computed the entropy for my model with the following command:
ngram -lm small_1.lm -counts small_1.cnt -counts-entropy
where small_1.lm is a trigram model with wbdiscount created from ngram-count and where small_1.cnt is a count file only including the events we want to predict.
The output is this:
file small_1.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -4.69264 ppl= 49276.1 ppl1= 49276.1
This model is really trained using a subset of my TRAIN.txt corpus. This model also gives the following ppl against an unseen test set:
file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs
0 zeroprobs, logprob= -38595.4 ppl= 339.378 ppl1= 476.644
On the other hand I have another model also a subset of my TRAIN.txt but a different subset from small_1.lm with entropy as follows:
file small_2.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -4.03253 ppl= 10777.8 ppl1= 10777.8
and a perplexity against the same unseen test set of:
file ../TEST.txt: 840 sentences, 15700 words, 1289 OOVs
0 zeroprobs, logprob= -38792.5 ppl= 349.627 ppl1= 491.891
So my question is, why is entropy bigger in the model whose ppl is actually the smallest? I thought that both measures could be used to measure the performance or quality of a language model. How can both numbers be so inconsistent? By the way my TRAIN.lm (model created from the whole of the training corpus) has an entropy of:
file TRAIN_EVENTS.cnt: 0 sentences, 1 words, 0 OOVs
0 zeroprobs, logprob= -11.5557 ppl= 3.59464e+11 ppl1= 3.59464e+11
which is humongous!
I am a complete beginner in this field and this is really not making any sense.
Any help will be greatly appreciated.
Regards to all,
Sai
_________________________________________________________________
MSN Video.
http://video.msn.com/?mkt=es-es
More information about the SRILM-User
mailing list