Class expansion
Andreas Stolcke
stolcke at speech.sri.com
Thu Feb 19 10:08:08 PST 2004
In message <1077203635.13538.38.camel at NOOL2>you wrote:
> Hello,
>
> I'm trying to to convert a class bigram to its equivalent word n-gram,
> using the "ngram" tool with the -expand-classes option. The class model
> has 1000 classes, and there are 60000 words. I use the following command
> line:
>
> ngram -lm <classmodel> -classes <classesfile> -expand-classes 2
> -write-lm <outputmodel>
>
> The process runs about 15 minutes using over 700M of RAM, and then gets
> killed by the OS (I'm using Linux), probably when it asked even more
> memory that the OS didn't have (I have 512M of main memory).
>
> Is it normal that the class expansion takes that much RAM? Is there a
> way around it?
It is expected. Your seeing a combinatorial explosion of ngrams
as the classes get expanded. In general it is not feasible to expand
a large-vocabulary class LM with several hundred classes.
ngram -expand-classes was designed for medium-vocabulary class LMs,
especially ones with hand-designed classes. It works fine for domains
like ATIS, SPINE, Communicator, etc.
There is a way around it, but it would require some coding.
You could do the class expansion, and interleave it with ngram pruning.
In other words, right after you expand all the class ngrams that share
a word ngram context you perform entropy-based pruning to retain only
those that "matter". This should dramantically reduce the size of
the expanded model.
--Andreas
More information about the SRILM-User
mailing list