Memory problems

Tue May 22 11:19:21 PDT 2001

Nuno,

you will have to raise the count thresholds on your bigrams, trigrams
and fourgrams to the point were you can fit things into memory, or 
at least where you can tolerate the paging.  (The LM estimation traverses
the count and LM data structures in a fairly localized fashion, so some
amount of paging is certainly tolerable).  

Use make-big-lm and play with the -gt2min, -gt3min and -gt4min
parameters until the memory requirements become managable.
Try eliminating the high-order ngrams first, they have a smaller effect 
on LM performance.  I have successfully built 5-gram models in 512 MB 
of memory from about 1.3 GB of gzipped counts using 

	-gt2min 1 -gt3min 2 -gt4min 4 -gt5min 4

As an independent measure, you could recompile the LM library and tools
with -DUSE_SARRAY_TRIE -DUSE_SARRAY, which switches to a slower, but 
less memory-wasting version of the data structures (the default setting 
is to optimize for speed).

Once you have managed to build the LM you probably also want to
apply entropic pruning to it (ngram -prune) to further reduce memory use
and loading time without sacrificing much performance.
The better approach would be to integrate the pruning with the estimation
so that irrelevant counts are excluded up front, but that will have to
wait on my (or someone's) to-do list.

Hope this helps.

--Andreas

In message <3B0A5AEC.8CD4D20B at weenie.inesc.pt>you wrote:
> Hi!
> I'm trying to use SRLIM toolkit to create a language model. As i'm using
> large amounts of text and i'm creating 4-grams language models the
> counts files (created using make-batch-counts/merge-batch-counts
> scripts)  gets to big - 1.3 Gb gziped. When i try to create the language
> model using ngram-count  or make-big-lm the programs abort because there
> isn't enough memory (500 Mb). How can i solve this problem.
> Regards
> 
> Souto
>