[SRILM User List] SRILM ngram-count speed

Thu Aug 10 23:38:42 PDT 2017

Mac,

please check out Andreas' suggestions on other ways to speed up your LM 
training. My suggestion is mostly based on the cases that (1) if you 
need to do this kind of incremental update of the corpus frequently or 
(2) the original corpus.txt file is already quite large.

Sorry, that's a typo, you should have -write-binary-lm -lm model.bin.

Thanks,

Wen

On 8/10/17 11:45 AM, Mac Neth wrote:
> Hi Wen,
>
> Thanks for that. I have tried your steps. But it seems the last step
> takes +/- the same time as initially : around 55sec:
>
> 1) few seconds
> ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz
>
> 2) few seconds
> ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz
>
> 3) few seconds
> ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz
>
> 4) around 55 seconds
> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm -lm model.bin
>
> I have added the option "-lm" in your command. Should I drop it ? Your
> command was:
>
> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin
>
> Thanks,
>
> Mac
>
>
>
> 2017-08-10 9:47 GMT+00:00 Wen Wang <wen.wang at sri.com>:
>> Mac,
>>
>> It seems that you need to incrementally update the corpus frequently. If
>> this is the case, you don't have to compute n-gram counts every time. To
>> save time and speed up training, you could first save n-gram counts from the
>> current corpus, by
>>
>> ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz
>>
>> Then just collect n-gram counts for the additional text, denoted add.txt
>> here,  that you are going to append to corpus.txt
>>
>> ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz
>>
>> Then you could merge the n-gram counts, by
>>
>> ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz
>>
>> Now, you could build the LM just by loading the updated counts,
>> new.3grams.gz, instead from the updated text:
>>
>> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin
>>
>> Thanks,
>>
>> Wen
>>
>>
>> On 8/10/17 2:29 AM, Mac Neth wrote:
>>> Hello,
>>>
>>> I am building a LM out of a corpus text file of around 8 MB using
>>> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds
>>> to build the langage model file.
>>>
>>> Each time I add a line or two to the corpus, I have to rebuild the LM
>>> file.
>>>
>>> I am using the command as follows :
>>>
>>> ngram-count -text corpus.txt -order 3 -lm model.lm
>>>
>>> I have been able to optimize the performance using the binary option with
>>> :
>>>
>>> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm
>>>
>>> and the LM file is now produced in around 1 minute.
>>>
>>> Is there any further optimization to speed up the LM building.
>>>
>>> Thanks in advance,
>>>
>>> Mac
>>>
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>
>>