[SRILM User List] make-big-lm produces different LM than ngram-count

Tue Sep 7 10:07:37 PDT 2010

Christian A. Mandery wrote:
> Hello,
>
> I am trying to use the make-big-lm script in order to get a way of
> building modified Kneser-Neys LMs that scale better with larger
> corpora.
>
> However, make-big-lm produces different LMs for me than ngram-count
> although I am using the same parameters.
>
> Not only probabilities and back-off values differ, also the LM build
> with ngram-count countains more {2,3,4}-grams than the LM build with
> make-big-lm.
>
>
> I invoke ngram-count with this parameters:
> ngram-count -order 4 -debug 4 -unk -map-unk "<UNK>" -vocab vocab-lm
> -gt1min 1 -gt2min 2 -gt3min 2 -gt4min 2 -kndiscount1 -kndiscount2
> -kndiscount3 -kndiscount4 -text corpus.gz -lm ngram-count.lm
>
> And make-big-lm:
> make-big-lm -read counts -name zzz-make-big-lm -order 4 -debug 4 -unk
> -map-unk "<UNK>" -vocab vocab-lm -gt1min 1 -gt2min 2 -gt3min 2 -gt4min
> 2 -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -lm
> make-big-lm.lm
>
>
> Why are there differences in the generated LM using these two calls?
>   
As explained in the FAQ, make-big-lm will compute the discounting 
parameters from the training corpus's full vocabulary, whereas 
ngram-count invoked directly will perform the mapping of OOVs and THEN 
compute the discounting parameters.   The first method is usually better.

Andreas

>
> Best regards
> Christian Mandery
>
>
> PS: counts-new.gz is built using "ngram-count -text corpus.gz -write
> counts -order 4 -sort", so nothing should go wrong there.
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>