[SRILM User List] make-big-lm produces different LM than ngram-count

Tue Sep 7 08:59:57 PDT 2010

Hello,

I am trying to use the make-big-lm script in order to get a way of
building modified Kneser-Neys LMs that scale better with larger
corpora.

However, make-big-lm produces different LMs for me than ngram-count
although I am using the same parameters.

Not only probabilities and back-off values differ, also the LM build
with ngram-count countains more {2,3,4}-grams than the LM build with
make-big-lm.

I invoke ngram-count with this parameters:
ngram-count -order 4 -debug 4 -unk -map-unk "<UNK>" -vocab vocab-lm
-gt1min 1 -gt2min 2 -gt3min 2 -gt4min 2 -kndiscount1 -kndiscount2
-kndiscount3 -kndiscount4 -text corpus.gz -lm ngram-count.lm

And make-big-lm:
make-big-lm -read counts -name zzz-make-big-lm -order 4 -debug 4 -unk
-map-unk "<UNK>" -vocab vocab-lm -gt1min 1 -gt2min 2 -gt3min 2 -gt4min
2 -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -lm
make-big-lm.lm

Why are there differences in the generated LM using these two calls?

Best regards
Christian Mandery

PS: counts-new.gz is built using "ngram-count -text corpus.gz -write
counts -order 4 -sort", so nothing should go wrong there.