[SRILM User List] SRILM Workflow: Improve train LM runtime

Mon Mar 30 22:39:23 PDT 2020

On 3/25/2020 12:59 PM, Müller, H.M. (Hanno) wrote:
> Hi Andreas,
>
> I'm training a LM on a huge corpus right now and it takes very long. I
> was wondering, how I could improve my workflow in order accelerate things.
>
> The corpus is stored in around 400 plain text files as a result of the
> preprocessing. This also allows for the fast creation of a count file
> (using ngram-count) for every corpus chunk as I can put every thread to
> the 400 cores of the server cluster I'm working with. After that, I use
> ngram-merge (also in parallel) and a binary tree approach to merge all
> the count files into one count file. This count file is then read into
> ngram-count again to derive the LM. However, this last step takes very
> long and I cannot see an option to parallelize it. Am I overseeing
> something?
>
> Otherwise, I could derive LMs for every 400 text files directly and
> combine them afterwards using ngram with the -mix-lm option and a
> -lambda of 0.5, I guess. Eventually, I would need to combine all 400 LMs
> by combining 9 at a time, whereby each thread of combining 9 at a time
> could be ascribed to a single core. I would iterate this step until only
> one LM is leftover. I assume that this might work. But would you
> recommend it? Or is there maybe a more elegant and faster approach I'm
> overseeing?
>
> Cheers,
>
> Hanno
>
Hanno,

You COULD build separate LMs from the batches of input files, and then 
merge them as you suggest.  However, I doubt that that would give you 
good results because proper smoothing relies on having the aggregate 
count-of-count statistics from the entire training corpus.

It is in principle possible to parallelize the estimation of ngram 
probabilities, but that is not implemented in ngram-count. However, I 
suspect that a large portion of your elapsed time is spent reading the 
counts. (You can test this by invoking ngram-count with just the -read 
option and seeing how long it takes.)

To speed up reading the counts there are a few strategies.  One is to 
use binary format.  Write the counts with ngram-count -write-binary.  If 
you have minimum counts  > 1 for higher-order ngrams and/or are using a 
limited vocabulary (relative to the full vocabulary of the training set) 
you can precompute the discounting parameters (like for KN smoothing) 
from the counts and instruct ngram-count only to read the ngrams that 
(a) fall within the LM vocabulary (-limit-vocab) and (2) meet the 
mincount thresholds (-read-with-mincounts).   Combined with binary 
format this allows skipping over the unused ngrams very efficiently (and 
also conserves memory).

The discounting (smoothing) parameter computation is illustrated in the 
make-big-lm wrapper script.  BTW, there is also a wrapper script for 
parallel counting (make-batch-counts) and count merging 
(merge-batch-counts).  These are documented in the training-scripts(1) 
man page.

Andreas