[SRILM User List] perplexity results

Tue Jan 24 07:27:56 PST 2017

If you have a look at the content of the first square brackets, you
can see that very few words come from 2-grams or higher. What this
means is the model could almost never find the context in the training
data and had to fall back on the unigram model quite a lot, so what
you see here is basically the performance of a -order 1 model -- but
the numbers seem quite high even for that... Are you sure the commands
you issued were the ones in your mail?

If yes, it would be interesting to see statistics of the corpus you
used. How big is the vocabulary? How big are the unigram frequencies?
Is it possible that the distribution has a very long tail, and almost
all words occur only 1-2 times?

I would also do some preprocessing on the data, like lowercasing
everything and running a tokenizer on it to split e.g. '"and' to the
two tokens '"' and 'and'.

On Tue, Jan 24, 2017 at 2:46 PM, Stef M <mstefd22 at gmail.com> wrote:
> Hello David.
>
> Thank you very much for answering. I am not sure if you received my reply as
> the yahoo servers have problems right now so i switched to gmail (sorry if
> you received already the email).
>
>
> I used Wikipedia parallel corpus en-es for training the two lms
> (http://opus.lingfil.uu.se/Wikipedia.php, 1.8M sentence pairs). I used the
> -debug 2 as you said and below are the results. Could you please help me
> understand why the perplexity numbers are so high for the EN sentences since
> they are well formed? For testing spanish i used machine translated output
> so i was expecting big numbers for ppl. Thank you!
>
>
> Sixty-six parent coordinators were laid off," the draft complaint says, "and
> not merely excessed.
> p( Sixty-six | <s> )  = [1gram] 2.16995e-09 [ -8.66355 ]
> p( parent | Sixty-six ...)  = [1gram] 1.0949e-05 [ -4.96063 ]
> p( coordinators | parent ...)  = [1gram] 3.37871e-07 [ -6.47125 ]
> p( were | coordinators ...)  = [1gram] 0.00120231 [ -2.91998 ]
> p( laid | were ...)  = [2gram] 0.000696035 [ -3.15737 ]
> p( off," | laid ...)  = [1gram] 2.33407e-08 [ -7.63189 ]
> p( the | off," ...)  = [2gram] 0.0469306 [ -1.32854 ]
> p( draft | the ...)  = [2gram] 7.67904e-05 [ -4.11469 ]
> p( complaint | draft ...)  = [1gram] 8.13141e-07 [ -6.08983 ]
> p( says, | complaint ...)  = [1gram] 1.17395e-05 [ -4.93035 ]
> p( "and | says, ...)  = [2gram] 0.00147669 [ -2.83071 ]
> p( not | "and ...)  = [1gram] 0.000275198 [ -3.56035 ]
> p( merely | not ...)  = [2gram] 0.00173666 [ -2.76029 ]
> p( <unk> | merely ...)  = [1gram] 0.0796503 [ -1.09881 ]
> p( </s> | <unk> ...)  = [1gram] 0.0258359 [ -1.58778 ]
> 1 sentences, 14 words, 0 OOVs
> 0 zeroprobs, logprob= -62.106 ppl= 13816.6 ppl1= 27298.9
>
>
> Mexico's Enrique Pena Nieto faces tough start
> p( Mexico's | <s> )  = [2gram] 1.31547e-06 [ -5.88092 ]
> p( Enrique | Mexico's ...)  = [1gram] 1.34348e-05 [ -4.87177 ]
> p( Pena | Enrique ...)  = [1gram] 1.83116e-06 [ -5.73727 ]
> p( Nieto | Pena ...)  = [1gram] 1.6622e-06 [ -5.77932 ]
> p( faces | Nieto ...)  = [1gram] 1.61354e-05 [ -4.79222 ]
> p( tough | faces ...)  = [1gram] 2.80928e-06 [ -5.5514 ]
> p( start | tough ...)  = [1gram] 2.90611e-05 [ -4.53669 ]
> p( </s> | start ...)  = [1gram] 0.00941231 [ -2.0263 ]
> 1 sentences, 7 words, 0 OOVs
> 0 zeroprobs, logprob= -39.1759 ppl= 78883.7 ppl1= 394964
>
>
>
> The NATO mission officially ended Oct. 31.
> p( The | <s> )  = [2gram] 0.143584 [ -0.842893 ]
> p( NATO | The ...)  = [3gram] 5.55208e-06 [ -5.25554 ]
> p( mission | NATO ...)  = [1gram] 3.10877e-05 [ -4.50741 ]
> p( officially | mission ...)  = [1gram] 2.81221e-05 [ -4.55095 ]
> p( ended | officially ...)  = [2gram] 0.00976927 [ -2.01014 ]
> p( Oct. | ended ...)  = [1gram] 2.4073e-07 [ -6.61847 ]
> p( 31. | Oct. ...)  = [1gram] 3.60453e-06 [ -5.44315 ]
> p( </s> | 31. ...)  = [2gram] 0.907671 [ -0.0420717 ]
> 1 sentences, 7 words, 0 OOVs
> 0 zeroprobs, logprob= -29.2706 ppl= 4558.57 ppl1= 15188.6