[SRILM User List] perplexity results

Thu Feb 9 04:21:29 PST 2017

Hello David and Andreas,

sorry for replying so late. Thank you very much for your suggestions.
Indeed, I had forgotten to preprocess the test set. I got better results
after preprocessing, so thanks a lot for pointing it out!

2017-01-24 19:06 GMT+01:00 Andreas Stolcke <stolcke at icsi.berkeley.edu>:

> Make sure text normalization is consistent between training and test data
> (e.g, capitalization - consider mapping to lower-case, and encoding of
> diacritics).
>
> Also, you're using -unk, i.e., your model contains an unknown-word token,
> which means OOVs get assigned a non-zero, but possibly very low
> probability.     This could mask big divergence in the vocabulary, and the
> high perplexity could be the result of lots of OOV words that all get a low
> probability via <unk>.  Try training without -unk and observe the tally of
> OOVs in the ppl output.
>
> Andreas
>
> On 1/24/2017 4:57 AM, Dávid Nemeskey wrote:
>
>> Hi,
>>
>> it is hard to tell without knowing e.g. the training set. But I would
>> try running ngram with higher values for -debug. I think even -debug 2
>> tells you the logprob of the individual words. That could be a start.
>> I actually added another debug level (100), where I print the 5 most
>> likely candidates (requires a "forward trie" in addition to the
>> default "backwards" one to be of usable speed) to get a sense of the
>> proportions and how the model and the text differs.
>>
>> Also, just wondering. Is the training corpus bilingual (en-es)?
>>
>> Best,
>> Dávid Nemeskey
>>
>> On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. <tsuki_stefy at yahoo.com> wrote:
>>
>>> Hello. I have a question regarding perplexity. I am using srilm to
>>> compute
>>> the perplexity of some sentences using a LM trained on a big corpus.
>>> Given a
>>> sentence and a LM, the perplexity tells how well that sentence fits to
>>> the
>>> language (as far as i understood). And the lower the perplexity, the
>>> better
>>> the sentence fits.
>>>
>>> $NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text
>>> Wikipedia.en-es.es -lm lm/lmodel_es.lm
>>>
>>> $NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl
>>> testlabeled.en-es.es  > perplexity_es_testlabeled.ppl
>>>
>>> I did the same on EN and on ES and here are some results I got:
>>>
>>> Sixty-six parent coordinators were laid off," the draft complaint says,
>>> "and
>>> not merely excessed.
>>> 1 sentences, 14 words, 0 OOVs
>>> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
>>>
>>> Mexico's Enrique Pena Nieto faces tough start
>>> 1 sentences, 7 words, 0 OOVs
>>> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
>>>
>>> The NATO mission officially ended Oct. 31.
>>> 1 sentences, 7 words, 0 OOVs
>>> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6
>>>
>>> Sesenta y seis padres coordinadores fueron despedidos," el proyecto de
>>> denuncia, dice, "y no simplemente excessed.
>>> 1 sentences, 16 words, 0 OOVs
>>> 0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72
>>>
>>> México Enrique Peña Nieto enfrenta duras comienzo
>>> 1 sentences, 7 words, 0 OOVs
>>> 0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7
>>>
>>>
>>> Why are the perplexities for the EN sentences so big? The smallest ppl i
>>> get
>>> for an EN sentence is about 250. The spanish sentences have some errors,
>>> so
>>> i was expecting big ppl numbers. Should i change something in the way i
>>> compute the lms?
>>>
>>> Thank you very much!!
>>>
>>>
>>>
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>
>>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170209/1daf6b38/attachment.html>