[SRILM User List] SRILM ngram-count speed
Wen Wang
wen.wang at sri.com
Thu Aug 10 02:47:22 PDT 2017
Mac,
It seems that you need to incrementally update the corpus frequently. If
this is the case, you don't have to compute n-gram counts every time. To
save time and speed up training, you could first save n-gram counts from
the current corpus, by
ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz
Then just collect n-gram counts for the additional text, denoted add.txt
here, that you are going to append to corpus.txt
ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz
Then you could merge the n-gram counts, by
ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz
Now, you could build the LM just by loading the updated counts,
new.3grams.gz, instead from the updated text:
ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin
Thanks,
Wen
On 8/10/17 2:29 AM, Mac Neth wrote:
> Hello,
>
> I am building a LM out of a corpus text file of around 8 MB using
> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds
> to build the langage model file.
>
> Each time I add a line or two to the corpus, I have to rebuild the LM file.
>
> I am using the command as follows :
>
> ngram-count -text corpus.txt -order 3 -lm model.lm
>
> I have been able to optimize the performance using the binary option with :
>
> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm
>
> and the LM file is now produced in around 1 minute.
>
> Is there any further optimization to speed up the LM building.
>
> Thanks in advance,
>
> Mac
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
More information about the SRILM-User
mailing list