[SRILM User List] How to interpolate big LMs?

Thu Aug 16 11:06:55 PDT 2012

On 8/16/2012 4:07 AM, Meng Chen wrote:
> Hi, suppose I have trained three big LMs: LM1 LM2 and LM3, each of 
> which has more than billions of ngrams. I wonder to know how to 
> interpolate such big LMs together. I found that the ngram command in 
> SRILM would load all the LMs in memory firstly, so it will reach the 
> limitation of server's memory. In such situation, how can I get the 
> interpolation of big LMs?
>
> Another question about training LM with large corpus. There are two 
> methods:
> 1) I can pool all data to train a big LM0.
> 2) I can split the data into several parts, and train small LMs (eg. 
> LM1 and LM2). Then interpolate them with average weight (eg. 0.5 X LM1 
> + 0.5 X LM2 ) to get the final LM3.
> All the cut-offs and smoothing algorithm are the same for both 
> methods. So does LM3 the same with LM0?
>
>
I'm assuming you are merging ngram LMs into one big LM (-mix-lm etc. 
WITHOUT the -bayes option).

In that case the LMs are merged destructively into the first LM, one by 
one.  This means at any given time only the partially merged LM and the 
next LM to be merged are kept in memory.  So when you're running

     ngram -lm LM1 -mix-lm LM2 -mix-lm2 LM3

it is NOT the case that LM1, LM2 and LM3 are in memory at the same 
time.  Instead, the result of merging LM1 and LM2, plus LM3 need to fit 
into memory.  Of course, depending on how much overlap in ngrams there 
is, that might be almost the same in terms of total memory.

Try building your binaries with OPTION=_c (compact memory).  Also, try 
using the latest beta version off the web site.  It contains an 
optimized memory allocator that leads to significant memory savings.  
Finally, if all else fails, prune your large component LMs prior to merging.

Andreas