[SRILM User List] perplexity results

Andreas Stolcke stolcke at icsi.berkeley.edu
Tue Jan 24 10:06:00 PST 2017


Make sure text normalization is consistent between training and test 
data (e.g, capitalization - consider mapping to lower-case, and encoding 
of diacritics).

Also, you're using -unk, i.e., your model contains an unknown-word 
token, which means OOVs get assigned a non-zero, but possibly very low 
probability.     This could mask big divergence in the vocabulary, and 
the high perplexity could be the result of lots of OOV words that all 
get a low probability via <unk>.  Try training without -unk and observe 
the tally of OOVs in the ppl output.

Andreas

On 1/24/2017 4:57 AM, Dávid Nemeskey wrote:
> Hi,
>
> it is hard to tell without knowing e.g. the training set. But I would
> try running ngram with higher values for -debug. I think even -debug 2
> tells you the logprob of the individual words. That could be a start.
> I actually added another debug level (100), where I print the 5 most
> likely candidates (requires a "forward trie" in addition to the
> default "backwards" one to be of usable speed) to get a sense of the
> proportions and how the model and the text differs.
>
> Also, just wondering. Is the training corpus bilingual (en-es)?
>
> Best,
> Dávid Nemeskey
>
> On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. <tsuki_stefy at yahoo.com> wrote:
>> Hello. I have a question regarding perplexity. I am using srilm to compute
>> the perplexity of some sentences using a LM trained on a big corpus. Given a
>> sentence and a LM, the perplexity tells how well that sentence fits to the
>> language (as far as i understood). And the lower the perplexity, the better
>> the sentence fits.
>>
>> $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text
>> Wikipedia.en-es.es -lm lm/lmodel_es.lm
>>
>> $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl
>> testlabeled.en-es.es  > perplexity_es_testlabeled.ppl
>>
>> I did the same on EN and on ES and here are some results I got:
>>
>> Sixty-six parent coordinators were laid off," the draft complaint says, "and
>> not merely excessed.
>> 1 sentences, 14 words, 0 OOVs
>> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
>>
>> Mexico's Enrique Pena Nieto faces tough start
>> 1 sentences, 7 words, 0 OOVs
>> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
>>
>> The NATO mission officially ended Oct. 31.
>> 1 sentences, 7 words, 0 OOVs
>> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6
>>
>> Sesenta y seis padres coordinadores fueron despedidos," el proyecto de
>> denuncia, dice, "y no simplemente excessed.
>> 1 sentences, 16 words, 0 OOVs
>> 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72
>>
>> México Enrique Peña Nieto enfrenta duras comienzo
>> 1 sentences, 7 words, 0 OOVs
>> 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7
>>
>>
>> Why are the perplexities for the EN sentences so big? The smallest ppl i get
>> for an EN sentence is about 250. The spanish sentences have some errors, so
>> i was expecting big ppl numbers. Should i change something in the way i
>> compute the lms?
>>
>> Thank you very much!!
>>
>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>




More information about the SRILM-User mailing list