<div dir="ltr">Hello David and Andreas,<div><br></div><div>sorry for replying so late. Thank you very much for your suggestions. Indeed, I had forgotten to preprocess the test set. I got better results after preprocessing, so thanks a lot for pointing it out!</div></div><div class="gmail_extra"><br><div class="gmail_quote">2017-01-24 19:06 GMT+01:00 Andreas Stolcke <span dir="ltr"><<a href="mailto:stolcke@icsi.berkeley.edu" target="_blank">stolcke@icsi.berkeley.edu</a>></span>:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">Make sure text normalization is consistent between training and test data (e.g, capitalization - consider mapping to lower-case, and encoding of diacritics).<br>

<br>

Also, you're using -unk, i.e., your model contains an unknown-word token, which means OOVs get assigned a non-zero, but possibly very low probability.     This could mask big divergence in the vocabulary, and the high perplexity could be the result of lots of OOV words that all get a low probability via <unk>.  Try training without -unk and observe the tally of OOVs in the ppl output.<br>

<br>

Andreas<br>

<br>

On 1/24/2017 4:57 AM, Dávid Nemeskey wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hi,<br>

<br>

it is hard to tell without knowing e.g. the training set. But I would<br>

try running ngram with higher values for -debug. I think even -debug 2<br>

tells you the logprob of the individual words. That could be a start.<br>

I actually added another debug level (100), where I print the 5 most<br>

likely candidates (requires a "forward trie" in addition to the<br>

default "backwards" one to be of usable speed) to get a sense of the<br>

proportions and how the model and the text differs.<br>

<br>

Also, just wondering. Is the training corpus bilingual (en-es)?<br>

<br>

Best,<br>

Dávid Nemeskey<br>

<br>

On Tue, Jan 24, 2017 at 1:14 PM, Stefy D. <<a href="mailto:tsuki_stefy@yahoo.com" target="_blank">tsuki_stefy@yahoo.com</a>> wrote:<br>

<blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex">

Hello. I have a question regarding perplexity. I am using srilm to compute<br>

the perplexity of some sentences using a LM trained on a big corpus. Given a<br>

sentence and a LM, the perplexity tells how well that sentence fits to the<br>

language (as far as i understood). And the lower the perplexity, the better<br>

the sentence fits.<br>

<br>

$NGRAMCOUNT_FILE -order 5 -interpolate -kndiscount -unk -text<br>

<a href="http://Wikipedia.en-es.es" rel="noreferrer" target="_blank">Wikipedia.en-es.es</a> -lm lm/lmodel_es.lm<br>

<br>

$NGRAM_FILE -order 5 -debug 1 -unk -lm lm/lmodel_es.lm -ppl<br>

<a href="http://testlabeled.en-es.es" rel="noreferrer" target="_blank">testlabeled.en-es.es</a>  > perplexity_es_testlabeled.ppl<br>

<br>

I did the same on EN and on ES and here are some results I got:<span class=""><br>

<br>

Sixty-six parent coordinators were laid off," the draft complaint says, "and<br>

not merely excessed.<br></span><span class="">

1 sentences, 14 words, 0 OOVs<br>

0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9<br>

<br>

Mexico's Enrique Pena Nieto faces tough start<br></span><span class="">

1 sentences, 7 words, 0 OOVs<br>

0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964<br>

<br>

The NATO mission officially ended Oct. 31.<br></span><span class="">

1 sentences, 7 words, 0 OOVs<br>

0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6<br>

<br></span>

Sesenta y seis padres coordinadores fueron despedidos," el proyecto de<br>

denuncia, dice, "y no simplemente excessed.<br>

1 sentences, 16 words, 0 OOVs<br>

0 zeroprobs, logprob= -57.0322 ppl= 2263.79 ppl1= 3668.72<br>

<br>

México Enrique Peña Nieto enfrenta duras comienzo<span class=""><br>

1 sentences, 7 words, 0 OOVs<br></span>

0 zeroprobs, logprob= -29.5672 ppl= 4964.71 ppl1= 16744.7<br>

<br>

<br>

Why are the perplexities for the EN sentences so big? The smallest ppl i get<br>

for an EN sentence is about 250. The spanish sentences have some errors, so<br>

i was expecting big ppl numbers. Should i change something in the way i<br>

compute the lms?<br>

<br>

Thank you very much!!<br>

<br>

<br>

<br>

______________________________<wbr>_________________<br>

SRILM-User site list<br>

<a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a><br>

<a href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user" rel="noreferrer" target="_blank">http://mailman.speech.sri.com/<wbr>cgi-bin/mailman/listinfo/srilm<wbr>-user</a><br>

</blockquote>

______________________________<wbr>_________________<br>

SRILM-User site list<br>

<a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a><br>

<a href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user" rel="noreferrer" target="_blank">http://mailman.speech.sri.com/<wbr>cgi-bin/mailman/listinfo/srilm<wbr>-user</a><br>

<br>

</blockquote>

<br>

<br>

______________________________<wbr>_________________<br>

SRILM-User site list<br>

<a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a><br>

<a href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user" rel="noreferrer" target="_blank">http://mailman.speech.sri.com/<wbr>cgi-bin/mailman/listinfo/srilm<wbr>-user</a></blockquote></div><br></div>