[SRILM] Some more FLM questions
Andreas Stolcke
stolcke at speech.sri.com
Mon Oct 23 15:46:53 PDT 2006
ilya oparin wrote:
>
> 2) Could you please specify how you work with large
> data?
> When I was training the model on 5M data, it was
> taking 1.2G of memory. Actually, I work with
> inflectional languages (Russian and Czech) so the
> factors are really "rich": features for each word
> include its stem, inflection, detailed morphological
> tag and lemma. May be that's why it takes so much
> space? Otherwise I can not get how you managed to run
> it for 30G words in English: in my case if I want to
> enlarge data it seems like I'll have to switch to
> 64-bit architecture. Does SRILM and FLM support 64-bit
> somehow?
> If it's only me that lucky with memory loads, what
> could you suggest to reduce it?
>
Yes, SRILM supports 64bit linux (and other) platforms. For Linux
running on AMD64-compatible
machines use
make MACHINE_TYPE=i686-m64
So reduce memory consumptions use the strategies described in doc/FAQ.
I'm copying here the relevant bits, many of which apply to
FLMs as well.
> Topic: Large data / too little memory issues
>
> 1) I'm getting a message saying (among other things)
>
> Assertion `body != 0' failed.
>
> A: You are running out of memory. See subsequent questions depending on
> what you are trying to do. Note: the above message means you are
> running
> out of "virtual" memory on your computer, which could be because of
> limits in swap space, administrative resource limits, or limitations of
> the machine architecture (a 32-bit machine cannot address more than
> 4GB no matter how many resources your system has).
> Another symptom of not enough memory is that your program runs, but
> very, very slowly, i.e., it is "paging" or "swapping" as it tries to
> use more memory than the machine has RAM installed.
>
> 2) I am trying to count N-grams in a text file and running out of memory.
>
> A: Don't use ngram-count directly to count N-grams. Instead, use the
> make-batch-counts and merge-batch-counts scripts described in
> training-scripts(1). That way you can create N-gram counts limited
> only
> by the maximum file size on your system.
>
> 3) I am trying to build an N-gram LM and ngram-count runs out of memory.
>
> A: You are running out of memory either because of the size of ngram
> counts,
> or of the LM being built. The following are strategies for reducing the
> memory requiredments for training LMs.
>
> a) Assuming you are using Good-Turing or Kneser-Ney discounting,
> don't use
> ngram-count in "raw" form. Instead, use the make-big-lm wrapper
> script described in the traning-scripts(1) man page.
> b) Switch to using the "_c" or "_s" versions of the SRI binaries. For
> instructions on how to build them, see the INSTALL file.
> Once built, set your executable seach path accordingly, and try
> make-big-lm again.
>
> c) Lower the minimum counts for N-grams included in the LM, i.e., the
> values of the options -gt2min, -gt3min, gt4min, etc. The higher
> order N-grams typically get higher minumum counts.
>
> d) Get a machine with more memory. If you are hitting the
> limitations of
> a 32-bit machine architecture, get a 64-bit machine and
> recompile SRILM
> to take advantage of the expanded address space. (The "i686-m64"
> MACHINE_TYPE setting is for systems based on 64-bit AMD
> processors.)
> Note: that the 64-bit pointers will require a memory overhead in
> themselves, so will need a machine with significantly, not just a
> little, more memory than 4GB.
>
> 4) I am trying to apply a large LM to some data and am running out of
> memory.
>
> A: Again, there are several strategies to reduce memory requirements.
>
> a) Use the "_c" or "_s" versions of the SRI binaries. See 3b) above.
>
> b) Precompute the vocabulary of your test data and use the
> ngram -limit-vocab option to load only the N-gram parameters
> relevant
> to your data. This approach should allow you to use arbitrarily
> large LMs provided the data is divided into small enough chunks.
>
> c) If the LM can be built on a large machine, but then is to be
> used on
> machines with limited memory, use ngram -prune to remove the less
> important parametere of the model. This usually gives huge size
> reductions with relatively modest performance degradation. The
> tradeoff is adjustable by varying the pruning parameter.
>
Andreas
More information about the SRILM-User
mailing list