Class expansion

Thu Feb 19 10:08:08 PST 2004

In message <1077203635.13538.38.camel at NOOL2>you wrote:
> Hello,
> 
> I'm trying to to convert a class bigram to its equivalent word n-gram,
> using the "ngram" tool with the -expand-classes option. The class model
> has 1000 classes, and there are 60000 words. I use the following command
> line:
> 
> ngram -lm <classmodel> -classes <classesfile> -expand-classes 2
> -write-lm <outputmodel>
> 
> The process runs about 15 minutes using over 700M of RAM, and then gets
> killed by the OS (I'm using Linux), probably when it asked even more
> memory that the OS didn't have (I have 512M of main memory).
> 
> Is it normal that the class expansion takes that much RAM? Is there a
> way around it?

It is expected.  Your seeing a combinatorial explosion of ngrams 
as the classes get expanded.   In general it is not feasible to expand
a large-vocabulary class LM with several hundred classes.

ngram -expand-classes was designed for medium-vocabulary class LMs,
especially ones with hand-designed classes.  It works fine for domains
like ATIS, SPINE, Communicator, etc.

There is a way around it, but it would require some coding.  
You could do the class expansion, and interleave it with ngram pruning.
In other words, right after you expand all the class ngrams that share
a word ngram context you perform entropy-based pruning to retain only
those that "matter".  This should dramantically reduce the size of 
the expanded model.

--Andreas