[SRILM User List] SRILM ngram-count speed

Thu Aug 10 02:47:22 PDT 2017

Mac,

It seems that you need to incrementally update the corpus frequently. If 
this is the case, you don't have to compute n-gram counts every time. To 
save time and speed up training, you could first save n-gram counts from 
the current corpus, by

ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz

Then just collect n-gram counts for the additional text, denoted add.txt 
here,  that you are going to append to corpus.txt

ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz

Then you could merge the n-gram counts, by

ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz

Now, you could build the LM just by loading the updated counts, 
new.3grams.gz, instead from the updated text:

ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin

Thanks,

Wen

On 8/10/17 2:29 AM, Mac Neth wrote:
> Hello,
>
> I am building a LM out of a corpus text file of around 8 MB using
> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds
> to build the langage model file.
>
> Each time I add a line or two to the corpus, I have to rebuild the LM file.
>
> I am using the command as follows :
>
> ngram-count -text corpus.txt -order 3 -lm model.lm
>
> I have been able to optimize the performance using the binary option with :
>
> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm
>
> and the LM file is now produced in around 1 minute.
>
> Is there any further optimization to speed up the LM building.
>
> Thanks in advance,
>
> Mac
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user