[SRILM User List] Right way to build LM

Mon Apr 28 03:01:14 PDT 2014

Dear all,

I attempted to build n-gram LM from Wikipedia text. I have
clean up all unwanted lines. I have approximately 36M words.
I splitted the text into 90:10 proportions. Then from the 90,
i splitted again into 4 joint training sets with increasing
size (with the largest is about 1M sentences).

Command i used are the followings:

1. Count n-gram and vocabulary:
ngram-count -text 1M -order 3 -write count.1M -write-vocab vocab.1M -unk

2. Build LM with ModKN:
ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm -kndiscount

3. Calculate perplexity:
ngram -ppl test -order 3 -lm kn.lm

My questions are:
1. Did i do it right?
2. Is there any optimization i can do in building LM?
3. How to calculate perplexity in log 2-based instead of log 10?

Thanks in advance.

Ismail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140428/a8d87059/attachment.html>