[SRILM User List] SRILM Workflow: Improve train LM runtime
Andreas Stolcke
stolcke at ICSI.Berkeley.EDU
Mon Mar 30 22:39:23 PDT 2020
On 3/25/2020 12:59 PM, Müller, H.M. (Hanno) wrote:
> Hi Andreas,
>
> I'm training a LM on a huge corpus right now and it takes very long. I
> was wondering, how I could improve my workflow in order accelerate things.
>
> The corpus is stored in around 400 plain text files as a result of the
> preprocessing. This also allows for the fast creation of a count file
> (using ngram-count) for every corpus chunk as I can put every thread to
> the 400 cores of the server cluster I'm working with. After that, I use
> ngram-merge (also in parallel) and a binary tree approach to merge all
> the count files into one count file. This count file is then read into
> ngram-count again to derive the LM. However, this last step takes very
> long and I cannot see an option to parallelize it. Am I overseeing
> something?
>
> Otherwise, I could derive LMs for every 400 text files directly and
> combine them afterwards using ngram with the -mix-lm option and a
> -lambda of 0.5, I guess. Eventually, I would need to combine all 400 LMs
> by combining 9 at a time, whereby each thread of combining 9 at a time
> could be ascribed to a single core. I would iterate this step until only
> one LM is leftover. I assume that this might work. But would you
> recommend it? Or is there maybe a more elegant and faster approach I'm
> overseeing?
>
> Cheers,
>
> Hanno
>
Hanno,
You COULD build separate LMs from the batches of input files, and then
merge them as you suggest. However, I doubt that that would give you
good results because proper smoothing relies on having the aggregate
count-of-count statistics from the entire training corpus.
It is in principle possible to parallelize the estimation of ngram
probabilities, but that is not implemented in ngram-count. However, I
suspect that a large portion of your elapsed time is spent reading the
counts. (You can test this by invoking ngram-count with just the -read
option and seeing how long it takes.)
To speed up reading the counts there are a few strategies. One is to
use binary format. Write the counts with ngram-count -write-binary. If
you have minimum counts > 1 for higher-order ngrams and/or are using a
limited vocabulary (relative to the full vocabulary of the training set)
you can precompute the discounting parameters (like for KN smoothing)
from the counts and instruct ngram-count only to read the ngrams that
(a) fall within the LM vocabulary (-limit-vocab) and (2) meet the
mincount thresholds (-read-with-mincounts). Combined with binary
format this allows skipping over the unused ngrams very efficiently (and
also conserves memory).
The discounting (smoothing) parameter computation is illustrated in the
make-big-lm wrapper script. BTW, there is also a wrapper script for
parallel counting (make-batch-counts) and count merging
(merge-batch-counts). These are documented in the training-scripts(1)
man page.
Andreas
More information about the SRILM-User
mailing list