[SRILM User List] SRILM ngram-count speed

Thu Aug 10 14:05:12 PDT 2017

The time spent in ngram-count is made up of two components:

- time to count the ngrams
- time to estimate the LM

Right now your training corpus is small and the first portion is small 
compared to the second.  So saving effort on that portion will not save 
you much overall.
However, if your base model were trained on a substantial corpus the 
savings would be a larger portion of the overall time.

Unfortunately you cannot do the LM estimation in an incremental way 
because various aspects of it (e.g., computing the smoothing parameters) 
depends on having the entire count distribution. However, you could use 
an approach where you don't train an entire new model including the 
added data, and just interpolate the base model with a small model 
trained only on the new data.  (I just responded to a different post on 
the list describing this approach.)  The result in model would be 
suboptimal but the speedup might be worth it, depending on your application.

Andreas

On 8/10/2017 11:45 AM, Mac Neth wrote:
> Hi Wen,
>
> Thanks for that. I have tried your steps. But it seems the last step
> takes +/- the same time as initially : around 55sec:
>
> 1) few seconds
> ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz
>
> 2) few seconds
> ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz
>
> 3) few seconds
> ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz
>
> 4) around 55 seconds
> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm -lm model.bin
>
> I have added the option "-lm" in your command. Should I drop it ? Your
> command was:
>
> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin
>
> Thanks,
>
> Mac
>
>
>
> 2017-08-10 9:47 GMT+00:00 Wen Wang <wen.wang at sri.com>:
>> Mac,
>>
>> It seems that you need to incrementally update the corpus frequently. If
>> this is the case, you don't have to compute n-gram counts every time. To
>> save time and speed up training, you could first save n-gram counts from the
>> current corpus, by
>>
>> ngram-count -debug 1 -order 3 -text corpus.txt -write model.3grams.gz
>>
>> Then just collect n-gram counts for the additional text, denoted add.txt
>> here,  that you are going to append to corpus.txt
>>
>> ngram-count -debug 1 -order 3 -text add.txt -write add.3grams.gz
>>
>> Then you could merge the n-gram counts, by
>>
>> ngram-merge -write new.3grams.gz model.3grams.gz add.3grams.gz
>>
>> Now, you could build the LM just by loading the updated counts,
>> new.3grams.gz, instead from the updated text:
>>
>> ngram-count -debug 1 -order 3 -read new.3grams.gz -write-binary-lm model.bin
>>
>> Thanks,
>>
>> Wen
>>
>>
>> On 8/10/17 2:29 AM, Mac Neth wrote:
>>> Hello,
>>>
>>> I am building a LM out of a corpus text file of around 8 MB using
>>> SRILM "ngram-count" command, and it takes around 1 minute 30 seconds
>>> to build the langage model file.
>>>
>>> Each time I add a line or two to the corpus, I have to rebuild the LM
>>> file.
>>>
>>> I am using the command as follows :
>>>
>>> ngram-count -text corpus.txt -order 3 -lm model.lm
>>>
>>> I have been able to optimize the performance using the binary option with
>>> :
>>>
>>> ngram-count -text corpus.txt -order 3 -lm model.lm -write-binary-lm
>>>
>>> and the LM file is now produced in around 1 minute.
>>>
>>> Is there any further optimization to speed up the LM building.
>>>
>>> Thanks in advance,
>>>
>>> Mac
>>>
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>
>>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>