[SRILM User List] compute perplexity

Andreas Stolcke stolcke at icsi.berkeley.edu
Thu Mar 27 10:25:35 PDT 2014

On 3/19/2014 10:57 AM, Stefy D. wrote:
> Dear Andreas,
> thank you very much for replying.
> I trained both LMs using the "-unk" option like this:
> $NGRAMCOUNT_FILE -order 3 -interpolate -kndiscount -unk -text 
> $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -lm 
> $WORKING_DIR"lm_a/lmodel.lm"

That explains who you are not getting OOVs reported in the ppl output.  
Unknown words are mapped to <unk> and thus the LM has a probability for 

> For the OOV rate I created a vocabulary list for the training data and 
> I used the unigram counts of the test set and the compute-oov-rate 
> script like this:
> $NGRAMCOUNT_FILE -order 1 -write-vocab "vocabularyTargetUnigram.txt" 
> -text $WORKING_DIR$OUT_CORPUS"lowercased."$TARGET -sort
> $NGRAMCOUNT_FILE -order 1  -text $WORKING_DIR"test.lowercased."$TARGET 
> -write "unigramCounts_testdatal.txt" -sort
> $OOVCOUNT_FILE vocabularyTargetUnigram.txt unigramCounts_testdata.txt
> This is how I got that OOV rate mentioned in the first mail. Could you 
> please let me know if I used the right commands to compute that?
You did it right.

> You said I should train LM_A using the vocabulary of corpus A + corpus 
> X so that the perplexities can be compared. So I should train LM_A 
> using only corpus A but the vocabulary of A + X? I am sorry to be 
> confused, but I thought that for estimating the LM the vocabulary 
> should be from the same corpus used for estimating. I am using these 
> LMs in SMT systems (a baseline and an adapted one). If I influence the 
> baseline LM with vocabulary from the adapted data, then the baseline 
> is not really a baseline. Please tell me if I am thinking incorrectly.
You are right.   What this illustrates is that perplexity alone is not a 
sufficient metric for comparing LMs.  In your scenario (LM adaptation) 
the expansion of the vocabulary is a key component of the adaptation 
process, but LMs with different vocabularies are no longer comparable by 
ppl.  My suggestion to unify the vocabularies was a workaround to allow 
you to still use perplexity comparison.

> Thank you for introducing me into statistical significance.
> To generate a table of word level probabilities on the same test set 
> should I use get-unigram-probs? But where do I specify the test set?
> $UNIGRAMPROBS_FILE linear=1 $WORKING_DIR"lm_a/lmodel.arpa."$TARGET > 
> table_A.out
No, you get the word probabilities from output of ngram -debug 2 -ppl 
(you need to write some perl or whatever script to extract the 
> To get how many words had lower/same/greater probability in LM_B is 
> using compare-ppls script ok? For example, I get this output when 
> applying it to my 2 LMs (ngram -debug 2 on the same test set as in 
> previous commands):
> $COMPARE_PPLS $WORKING_DIR"ppl_files/ppl_A_detail.ppl" 
> $WORKING_DIR"ppl_files/ppl_B_detail.ppl"
> output: total 22450, equal 0, different 22450, greater 11447
Yes, it seems compare-ppls extracts exactly the statistics I was talking 
about.  I had forgotten about it ...


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140327/adc5d350/attachment.html>

More information about the SRILM-User mailing list