[SRILM User List] compute perplexity

Tue Mar 18 22:02:22 PDT 2014

On 3/18/2014 12:44 PM, Stefy D. wrote:
> Dear all,
>
> I have some questions regarding perplexity...I am very thankful for 
> your time/ answers.
>
> Settings:
> - one language model LM_A estimated using training corpus A
> - one language model LM_B estimated using training corpus B (B = 
> corpus_A + corpus_X)
>
> My intention is to prove that model B is better than model A so I 
> though I should show that the perplexity decreased (which can be seen 
> from the ppl files).
>
> Commands used to estimate ppl:
> $NGRAM_FILE -order 3  -lm $WORKING_DIR"lm_A/lmodel.lm" -ppl 
> $WORKING_DIR"test.lowercased."$TARGET >  $WORKING_DIR"ppl_A.ppl"
>
> $NGRAM_FILE -order 3  -lm $WORKING_DIR"lm_B/lmodel.lm" -ppl 
> $WORKING_DIR"test.lowercased."$TARGET >  $WORKING_DIR"ppl_B.ppl"
>
> This contents of the two ppl files is (A then B):
> 1000 sentences, 21450 words, 0 OOVs
> 0 zeroprobs, logprob= -57849.4 ppl= 377.407 ppl1= 497.67
> -------------------------------------------------------------------------------------------
> 1000 sentences, 21450 words, 0 OOVs
> 0 zeroprobs, logprob= -55535.3 ppl= 297.67 ppl1= 388.204
>
> Questions:
> 1. Why do I get 0 OOVs? I checked using the compute-oov-rate script 
> how many OOV there are in the test data compared to the training and 
> it gave me the result "OOV tokens: 393 / 21450 (1.83%) excluding 
> fragments: 390 / 21442 (1.82%)".
You didn't say how you trained the LMs.  Did you include an unknown-word 
probability?   The exact option used for LM training matter here.
>
> 2. I read on the srilm-faq that "Note that perplexity comparisons are 
> only ever meaningful if the vocabularies of all LMs are the same." 
> Since I want to compare perplexities of two LM I am wondering if I did 
> the right thing with my settings and commands used. The two LM were 
> estimated on different training corpora so the vocabularies are not 
> identical, right? Please tell me what am I doing wrong.
Again, we don't know how you trained the LMs, hence we don't know the 
vocabularies.
The best way to make the perplexities comparable would be to extract the 
vocabulary from corpus A + corpus X, and then specify that for training 
LM_A (using -vocab).

>
> 3. If those two perplexities were computed correctly, then could you 
> please tell me if their difference means that the LM model has been 
> really improved and if there is a measure that says if this 
> improvement is significantly?
The perplexities looks quite different.  Differences of 10-20% are 
usually considered non-negligible.
For statistical significance there are a number of tests you can apply, 
although none are built into SRILM.

The most straightforward tests would be nonparametric ones that compare 
the probabilities output by the two LMs for corresponding word or sentences.
Generate a table of word-level probabilities for LM_A and then LM_B, on 
the same test set.  Then ask, how many words had lower/same/greater 
probability in LM_B?
 From those statistics you can apply either the Sign test 
<http://en.wikipedia.org/wiki/Sign_test> or the stronger Wilcoxon test 
<http://en.wikipedia.org/wiki/Wilcoxon_signed-rank_test> (for the latter 
you need the differences of the probabilities, not just their sign).

The Sign test is extremely simple and can be computed with a small 
helper script included in SRILM.  For example if LM_B gives higher 
probability for 1080 out of 2000 words (and there are no ties), then the 
significance levels are computed by

% $SRILM/bin/cumbin 2000 1080
One-tailed: P(k >= 1080 | n=2000, p=0.5) = 0.00018750253721029
Two-tailed: 2*P(k >= 1080 | n=2000, p=0.5) = 0.00037500507442058

Doing this at the word-level assumes that all the words in a sentence 
are assigned probabilities independently, which is plainly not true (the 
same word occurs in several ngrams).  So a more conservative approach 
would compare the sentence-level probabilities.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140318/5dbc9109/attachment.html>