[SRILM User List] Right way to build LM
Ismail Rusli
ismail.indonesia at gmail.com
Mon Apr 28 03:01:14 PDT 2014
Dear all,
I attempted to build n-gram LM from Wikipedia text. I have
clean up all unwanted lines. I have approximately 36M words.
I splitted the text into 90:10 proportions. Then from the 90,
i splitted again into 4 joint training sets with increasing
size (with the largest is about 1M sentences).
Command i used are the followings:
1. Count n-gram and vocabulary:
ngram-count -text 1M -order 3 -write count.1M -write-vocab vocab.1M -unk
2. Build LM with ModKN:
ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm -kndiscount
3. Calculate perplexity:
ngram -ppl test -order 3 -lm kn.lm
My questions are:
1. Did i do it right?
2. Is there any optimization i can do in building LM?
3. How to calculate perplexity in log 2-based instead of log 10?
Thanks in advance.
Ismail
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140428/a8d87059/attachment.html>
More information about the SRILM-User
mailing list