Perplexity calculation: Strange behavior

Stefan Hahn hahn at i6.informatik.rwth-aachen.de
Wed Aug 31 11:31:45 PDT 2005


Hi!

During some language modeling using the SRI Toolkit (V.1.4.3 and V.1.4.5) on
i686 Intel GNU/Linux I encountered some strange behavior concerning perplexity 
calculation:
For any order greater than 3, the perplexity calculated with ngram seems to be 
fixed and wrong.
For example, I used Defoe's "Robinson Crusoe" to create modified Kneser-Ney 
discounted Language Models for orders 1 up to 6 and calculated the perplexity 
for the same text using "ngram" and our own software:

        +------------------------+
        I      perplexity        I
+-------+-------------+----------+
I order | SRI-Toolkit I our Tool I
+-------+-------------+----------+
I   1   I   394.79    I 394.794  I
+-------+-------------+----------+
I   2   I   68.0706   I 68.071   I
+-------+-------------+----------+
I   3   I   54.29     I 54.2903  I
+-------+-------------+----------+
I   4   I   57.1554   I 52.6306  I
+-------+-------------+----------+
I   5   I   57.1554   I 52.6502  I
+-------+-------------+----------+
I   6   I   57.1554   I 52.7033  I
+-------+-------------+----------+

The script I used to download "Robinson Crusoe", create the LMs and 
SRI-results:

wget "http://www-i6.informatik.rwth-aachen.de/~gollan/make-lm-01.sh"
chmod a+x make-lm-01.sh
./make-lm-01.sh

Is there any error in my script?
Thanks,
 Stefan



More information about the SRILM-User mailing list