Comparision between SRILM and CMU

Andreas Stolcke stolcke at
Tue Sep 3 09:25:11 PDT 2002

In message <B0793DB946E52942A49C1E8152A1358CE68BC1 at>you wrot
> Hi all,
> I have a general question about the toolkit. I have just started using this
> SRILM toolkit, before I always used CMU toolkit, so I wanted to do a
> comparision between the language models created with one and with the other
> toolkit. So I created language models with the same corpus using both
> toolkits, and I compute the perplexities with each toolkit (I mean, that I
> use the same toolkit for creation and evaluation of the perplexity) and the
> perplexities were quite different always better for the SRILM, so then I
> tried to compute the perplexities of the CMU language models with the SRILM
> toolkit and then I got strange results, since most of the time the
> performance of the same CMU language model was better when computing the
> perplexity with SRILM instead of CMU, except for one case were the value
> that the SRILM gave was extremely high. After this, I did it the other way
> arround, I used the CMU to evaluate the SRILM language models, and after
> some trouble because of the format and some special requeriments of the CMU
> toolkit, I got worse results when using the CMU toolkit for evaluating the
> perplexity of the SRILM language models (and when the text used for
> evaluating perplexity contained OOV words, CMU gave an error.) My question
> is what is the difference in the computation of perplexity in the two
> toolkits. And also what is the meaning of the "ppl1" that SRILM toolkit
> gives.


I think what you are doing is an excellent idea, and I'm sure people here
would like to see the results, once you figured out the bugs.

Regarding your last question:  ppl1 is the perplexity excluding 
end-of-sentence tokens.  That is, you normalize the total log likelihood
by the number of words, rather than (number of words + number of </s> tags)
for computing perplexity.  This is a little more meaningful (though not
perfect) when comparing perplexities on test sets that follow different
rules for sentence segmentation.

About the discrepancies between CMU and SRI toolkits:  I think the only
way to resolve this is to dump out the word-level probabilities and 
compare them one-by-one.  This should allow you to tell how the two
differ in their perplexity computation.  In SRILM, you can use
ngram -debug 2 -ppl for this. My suspicion is that it has something to do
with the way OOV words are handled. 

Also, I'd be interested to know what prevented the SRILM-built LM
from working with the CMU tools.  If it's something simple we will
fix it (unless it is clearly a CMU bug).


More information about the SRILM-User mailing list