[SRILM User List] Right way to build LM

Mon Apr 28 16:20:26 PDT 2014

On 4/28/2014 3:01 AM, Ismail Rusli wrote:
> Dear all,
>
> I attempted to build n-gram LM from Wikipedia text. I have
> clean up all unwanted lines. I have approximately 36M words.
> I splitted the text into 90:10 proportions. Then from the 90,
> i splitted again into 4 joint training sets with increasing
> size (with the largest is about 1M sentences).
>
> Command i used are the followings:
>
> 1. Count n-gram and vocabulary:
> ngram-count -text 1M -order 3 -write count.1M -write-vocab vocab.1M -unk
>
> 2. Build LM with ModKN:
> ngram-count -vocab vocab.1M -read count.1M -order 3 -lm kn.lm -kndiscount

There is no need to specify -vocab if you are getting it from the same 
training data as the counts.
The use of -vocab is to specify a vocabulary that differs from that of 
the training data.
In fact you can combine 1 and 2 in one comment that is equivalent:

ngram-count -text 1M -order 3  -unk -lm kn.lm -kndiscount

Also, if you do use two steps, be sure to include the -unk option in the 
second step.

>
> 3. Calculate perplexity:
> ngram -ppl test -order 3 -lm kn.lm
>
> My questions are:
> 1. Did i do it right?
It looks like you did.

> 2. Is there any optimization i can do in building LM?
a. Try different -order values
b. Different smoothing methods.
c. Possibly class-based models (interpolated with word-based)
d. If you want to increase training data size significantly check the 
methods for conserving memory on the FAQ page.
> 3. How to calculate perplexity in log 2-based instead of log 10?
Perplexity is not dependent on the base of the logarithm (the log base 
is matched by the number you exponentiate to get the ppl).

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140428/5b5e4d6e/attachment.html>