[SRILM User List] Right way to build LM

Ismail Rusli ismail.indonesia at gmail.com
Mon Apr 28 19:38:44 PDT 2014

Thanks for the answer, Andreas.

As i read paper by
Chen and Goodman (1999), they used held-out data
to optimize parameters in language model. How do i
do this in SRILM? Does SRILM optimize parameters
when i use -kndiscount? I tried -kn to save
parameters in a file and included this file
when building LM but it turned out
my perplexity is getting bigger.

And just one more,
do you have a link to good tutorial in using
class-based models with SRILM?


On 04/29/2014 06:20 AM, Andreas Stolcke wrote:
> On 4/28/2014 3:01 AM, Ismail Rusli wrote:
>> Dear all,
>> I attempted to build n-gram LM from Wikipedia text. I have
>> clean up all unwanted lines. I have approximately 36M words.
>> I splitted the text into 90:10 proportions. Then from the 90,
>> i splitted again into 4 joint training sets with increasing
>> size (with the largest is about 1M sentences).
>> Command i used are the followings:
>> 1. Count n-gram and vocabulary:
>> ngram-count -text 1M -order 3 -write count.1M -write-vocab vocab.1M -unk
>> 2. Build LM with ModKN:
>> ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm -kndiscount
> There is no need to specify -vocab if you are getting it from the same 
> training data as the counts.
> The use of -vocab is to specify a vocabulary that differs from that of 
> the training data.
> In fact you can combine 1 and 2 in one comment that is equivalent:
> ngram-count -text 1M -order 3  -unk -lm kn.lm -kndiscount
> Also, if you do use two steps, be sure to include the -unk option in 
> the second step.
>> 3. Calculate perplexity:
>> ngram -ppl test -order 3 -lm kn.lm
>> My questions are:
>> 1. Did i do it right?
> It looks like you did.
>> 2. Is there any optimization i can do in building LM?
> a. Try different -order values
> b. Different smoothing methods.
> c. Possibly class-based models (interpolated with word-based)
> d. If you want to increase training data size significantly check the 
> methods for conserving memory on the FAQ page.
>> 3. How to calculate perplexity in log 2-based instead of log 10?
> Perplexity is not dependent on the base of the logarithm (the log base 
> is matched by the number you exponentiate to get the ppl).
> Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140429/af58257f/attachment.html>

More information about the SRILM-User mailing list