[SRILM] Some more FLM questions

Mon Oct 23 15:46:53 PDT 2006

ilya oparin wrote:
>
> 2) Could you please specify how you work with large
> data?
> When I was training the model on 5M data, it was
> taking 1.2G of memory. Actually, I work with
> inflectional languages (Russian and Czech) so the
> factors are really  "rich": features for each word
> include its stem, inflection, detailed morphological
> tag and lemma. May be that's why it takes so much
> space? Otherwise I can not get how you managed to run
> it for 30G words in English: in my case if I want to
> enlarge data it seems like I'll have to switch to
> 64-bit architecture. Does SRILM and FLM support 64-bit
> somehow?
> If it's only me that lucky with memory loads, what
> could you suggest to reduce it?
>   
Yes, SRILM supports 64bit linux (and other) platforms.  For Linux 
running on AMD64-compatible
machines use

    make MACHINE_TYPE=i686-m64

So reduce memory consumptions use the strategies described in doc/FAQ.  
I'm copying here the relevant bits, many of which apply to
FLMs as well.

> Topic: Large data / too little memory issues
>
> 1) I'm getting a message saying (among other things)
>
>         Assertion `body != 0' failed.
>
> A: You are running out of memory.  See subsequent questions depending on
>    what you are trying to do.  Note: the above message means you are 
> running
>    out of "virtual" memory on your computer, which could be because of
>    limits in swap space, administrative resource limits, or limitations of
>    the machine architecture (a 32-bit machine cannot address more than
>    4GB no matter how many resources your system has).
>    Another symptom of not enough memory is that your program runs, but
>    very, very slowly, i.e., it is "paging" or "swapping" as it tries to
>    use more memory than the machine has RAM installed.
>
> 2) I am trying to count N-grams in a text file and running out of memory.
>
> A: Don't use ngram-count directly to count N-grams.  Instead, use the
>    make-batch-counts and merge-batch-counts scripts described in
>    training-scripts(1).  That way you can create N-gram counts limited 
> only
>    by the maximum file size on your system.
>
> 3) I am trying to build an N-gram LM and ngram-count runs out of memory.
>
> A: You are running out of memory either because of the size of ngram 
> counts,
>    or of the LM being built. The following are strategies for reducing the
>    memory requiredments for training LMs.
>
>      a) Assuming you are using Good-Turing or Kneser-Ney discounting, 
> don't use
>         ngram-count in "raw" form.  Instead, use the make-big-lm wrapper
>         script described in the traning-scripts(1) man page.    

> b) Switch to using the "_c" or "_s" versions of the SRI binaries.  For
>         instructions on how to build them, see the INSTALL file.
>         Once built, set your executable seach path accordingly, and try
>         make-big-lm again.
>
>      c) Lower the minimum counts for N-grams included in the LM, i.e., the
>         values of the options -gt2min, -gt3min, gt4min, etc.  The higher
>         order N-grams typically get higher minumum counts.
>
>      d) Get a machine with more memory.  If you are hitting the 
> limitations of
>         a 32-bit machine architecture, get a 64-bit machine and 
> recompile SRILM
>         to take advantage of the expanded address space. (The "i686-m64"
>         MACHINE_TYPE setting is for systems based on 64-bit AMD 
> processors.)
>         Note: that the 64-bit pointers will require a memory overhead in
>         themselves, so will need a machine with significantly, not just a
>         little, more memory than 4GB.
>
> 4) I am trying to apply a large LM to some data and am running out of 
> memory.
>
> A: Again, there are several strategies to reduce memory requirements.
>
>      a) Use the "_c" or "_s" versions of the SRI binaries.  See 3b) above.
>
>      b) Precompute the vocabulary of your test data and use the
>         ngram -limit-vocab option to load only the N-gram parameters 
> relevant
>         to your data.  This approach should allow you to use arbitrarily
>         large LMs provided the data is divided into small enough chunks.
>
>      c) If the LM can be built on a large machine, but then is to be 
> used on
>         machines with limited memory, use ngram -prune to remove the less
>         important parametere of the model.  This usually gives huge size
>         reductions with relatively modest performance degradation.  The
>         tradeoff is adjustable by varying the pruning parameter.
>

Andreas