Perplexity calculation: Strange behavior

Stefan Hahn hahn at i6.informatik.rwth-aachen.de
Thu Sep 1 02:52:29 PDT 2005


Hi again!

Your guess was perfectly right, I simply overlooked to specify the -order 
option for perplexity calculation....

Thanks again,
 Stefan



> In message <200508312031.45859.hahn at i6.informatik.rwth-aachen.de>you wrote:
> > Hi!
> >
> > During some language modeling using the SRI Toolkit (V.1.4.3 and V.1.4.5)
> > on i686 Intel GNU/Linux I encountered some strange behavior concerning
> > perplexit y
> > calculation:
> > For any order greater than 3, the perplexity calculated with ngram seems
> > to b e
> > fixed and wrong.
> > For example, I used Defoe's "Robinson Crusoe" to create modified
> > Kneser-Ney discounted Language Models for orders 1 up to 6 and calculated
> > the perplexity
> >
> > for the same text using "ngram" and our own software:
> >
> >         +------------------------+
> >         I      perplexity        I
> > +-------+-------------+----------+
> > I order | SRI-Toolkit I our Tool I
> > +-------+-------------+----------+
> > I   1   I   394.79    I 394.794  I
> > +-------+-------------+----------+
> > I   2   I   68.0706   I 68.071   I
> > +-------+-------------+----------+
> > I   3   I   54.29     I 54.2903  I
> > +-------+-------------+----------+
> > I   4   I   57.1554   I 52.6306  I
> > +-------+-------------+----------+
> > I   5   I   57.1554   I 52.6502  I
> > +-------+-------------+----------+
> > I   6   I   57.1554   I 52.7033  I
> > +-------+-------------+----------+
>
> I haven't looked at your script, but my guess is that you didn't specify
> the -order option when evaluating the LM.  The default is to only use
> up to trigram probabilities regardless of what is in the LM file.
> (That's for historical reasons.)  So of course you get same result for
> any LM order >=4 . Also, because of KN, you are getting a degradation
> relative to the trigram, as the lower-order probabilities are optimized
> to minimize the higher-order estimates.
>
> If this is not the case then we may have a bug, but I can assure you that
> we use order >= 4 all the time.
>
> --Andreas
>
> > The script I used to download "Robinson Crusoe", create the LMs and
> > SRI-results:
> >
> > wget "http://www-i6.informatik.rwth-aachen.de/~gollan/make-lm-01.sh"
> > chmod a+x make-lm-01.sh
> > ./make-lm-01.sh
> >
> > Is there any error in my script?
> > Thanks,
> >  Stefan



More information about the SRILM-User mailing list