[SRILM User List] Compacting language models

Wed Feb 23 06:28:56 PST 2011

Luis Uebel wrote:
> I am using SRI to produce some reverse language models and are quite big.
> Stats: training data: 1.1G words
>                                 88M sentences
>
> but system was limited to 39k words (wordlist.txt) by:
> ngram-count -memuse -order 3 -interpolate -kndiscount -unk -vocab 
> ../lang-data/wordlist.txt -limit-vocab -text 
> ../lang-data/${training}-${reverse}.xml -lm 
> ${training}-reverse-lm${trigram}
>
>
> Is there other options to reduce LM size since trigrams are 1.7G? 
> (without so much lost in performance)?

Luis,

if the issue is that training takes too much memory, please see the FAQ 
on memory issues.

If you already have a (large) LM and want to reduce its size for test 
purposes, us the ngram -prune option. You want to read the following 
papers to understand how LM pruning works:

       A.  Stolcke,''  Entropy-based  Pruning of Backoff Language 
Models,'' Proc. DARPA Broadcast News Transcription
       and Understanding Workshop, pp. 270-274, Lansdowne, VA, 1998.
       C. Chelba,  T. Brants, W. Neveitt, and P. Xu, ''Study on 
Interaction Between Entropy Pruning  and  Kneser-Ney
       Smoothing,'' Proc. Interspeech, pp. 2422-2425, Makuhari, Japan, 2010.

Andreas

>
> Thanks,
>
>
> Luis
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user