[SRILM User List] How to interpolate big LMs?
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Aug 16 11:06:55 PDT 2012
On 8/16/2012 4:07 AM, Meng Chen wrote:
> Hi, suppose I have trained three big LMs: LM1 LM2 and LM3, each of
> which has more than billions of ngrams. I wonder to know how to
> interpolate such big LMs together. I found that the ngram command in
> SRILM would load all the LMs in memory firstly, so it will reach the
> limitation of server's memory. In such situation, how can I get the
> interpolation of big LMs?
>
> Another question about training LM with large corpus. There are two
> methods:
> 1) I can pool all data to train a big LM0.
> 2) I can split the data into several parts, and train small LMs (eg.
> LM1 and LM2). Then interpolate them with average weight (eg. 0.5 X LM1
> + 0.5 X LM2 ) to get the final LM3.
> All the cut-offs and smoothing algorithm are the same for both
> methods. So does LM3 the same with LM0?
>
>
I'm assuming you are merging ngram LMs into one big LM (-mix-lm etc.
WITHOUT the -bayes option).
In that case the LMs are merged destructively into the first LM, one by
one. This means at any given time only the partially merged LM and the
next LM to be merged are kept in memory. So when you're running
ngram -lm LM1 -mix-lm LM2 -mix-lm2 LM3
it is NOT the case that LM1, LM2 and LM3 are in memory at the same
time. Instead, the result of merging LM1 and LM2, plus LM3 need to fit
into memory. Of course, depending on how much overlap in ngrams there
is, that might be almost the same in terms of total memory.
Try building your binaries with OPTION=_c (compact memory). Also, try
using the latest beta version off the web site. It contains an
optimized memory allocator that leads to significant memory savings.
Finally, if all else fails, prune your large component LMs prior to merging.
Andreas
More information about the SRILM-User
mailing list