[SRILM User List] SRILM ngram-count speed

Thu Aug 10 11:45:35 PDT 2017

Hi Wen,

Thanks for that. I have tried your steps. But it seems the last step
takes +/- the same time as initially : around 55sec:

1) few seconds
ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz

2) few seconds
ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz

3) few seconds
ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz

4) around 55 seconds
ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm -lm model.bin

I have added the option "-lm" in your command. Should I drop it ? Your
command was:

ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin

Thanks,

Mac

2017-08-10 9:47 GMT+00:00 Wen Wang <wen.wang at sri.com>:
> Mac,
>
> It seems that you need to incrementally update the corpus frequently. If
> this is the case, you don't have to compute n-gram counts every time. To
> save time and speed up training, you could first save n-gram counts from the
> current corpus, by
>
> ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz
>
> Then just collect n-gram counts for the additional text, denoted add.txt
> here,  that you are going to append to corpus.txt
>
> ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz
>
> Then you could merge the n-gram counts, by
>
> ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz
>
> Now, you could build the LM just by loading the updated counts,
> new.3grams.gz, instead from the updated text:
>
> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin
>
> Thanks,
>
> Wen
>
>
> On 8/10/17 2:29 AM, Mac Neth wrote:
>>
>> Hello,
>>
>> I am building a LM out of a corpus text file of around 8 MB using
>> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds
>> to build the langage model file.
>>
>> Each time I add a line or two to the corpus, I have to rebuild the LM
>> file.
>>
>> I am using the command as follows :
>>
>> ngram-count -text corpus.txt -order 3 -lm model.lm
>>
>> I have been able to optimize the performance using the binary option with
>> :
>>
>> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm
>>
>> and the LM file is now produced in around 1 minute.
>>
>> Is there any further optimization to speed up the LM building.
>>
>> Thanks in advance,
>>
>> Mac
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
>
>