[SRILM User List] Compacting language models
Andreas Stolcke
stolcke at icsi.berkeley.edu
Wed Feb 23 06:28:56 PST 2011
Luis Uebel wrote:
> I am using SRI to produce some reverse language models and are quite big.
> Stats: training data: 1.1G words
> 88M sentences
>
> but system was limited to 39k words (wordlist.txt) by:
> ngram-count -memuse -order 3 -interpolate -kndiscount -unk -vocab
> ../lang-data/wordlist.txt -limit-vocab -text
> ../lang-data/${training}-${reverse}.xml -lm
> ${training}-reverse-lm${trigram}
>
>
> Is there other options to reduce LM size since trigrams are 1.7G?
> (without so much lost in performance)?
Luis,
if the issue is that training takes too much memory, please see the FAQ
on memory issues.
If you already have a (large) LM and want to reduce its size for test
purposes, us the ngram -prune option. You want to read the following
papers to understand how LM pruning works:
A. Stolcke,'' Entropy-based Pruning of Backoff Language
Models,'' Proc. DARPA Broadcast News Transcription
and Understanding Workshop, pp. 270-274, Lansdowne, VA, 1998.
C. Chelba, T. Brants, W. Neveitt, and P. Xu, ''Study on
Interaction Between Entropy Pruning and Kneser-Ney
Smoothing,'' Proc. Interspeech, pp. 2422-2425, Makuhari, Japan, 2010.
Andreas
>
> Thanks,
>
>
> Luis
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
More information about the SRILM-User
mailing list