[SRILM User List] Question Regarding SRILM N-gram tools

Wed Aug 25 11:16:14 PDT 2010

Thank you Andreas. This was very helpful.  I will make use of the SRI
mailing list from now on.

Ryan Roth
CCLS
Columbia University

On Tue, Aug 24, 2010 at 5:49 PM, Andreas Stolcke <stolcke at speech.sri.com>wrote:

> Ryan,
>
> I suggested you use the -limit-vocab option with ngram, and write out your
> LM in binary.
> Reading a binary LM with -limit-vocab is very efficient in processing only
> the portions of the LM parameters that pertain to your test set vocabulary.
> You can generate the vocabulary used by your test data using
>
> ngram-count -text DATA -write-vocab VOCAB
>
> There is a tradeoff between processing small batches of data (hence small
> vocabularies, hence fast loading of the LM) with large batches (larger
> vocabularies, but the LM fewer times), so you might want to tune the batch
> size empirically for best overall throughput.
>
> If LM load time is still a limiting factor with this approach you should
> use an LM server (see ngram -use-server option), which effectively means you
> load the LM into memory only once.
>
> I suggest you join the srilm-user list and direct future questions there.
>
> Andreas
>
>
> Ryan Roth wrote:
>
>> Hello:
>>
>> My name is Ryan Roth and I work at Columbia University's Center for
>> Computational Learning Systems. My research focus currently is on Arabic
>> Natural Language Processing.
>>
>> I have question about the SRILM toolkit that I hope you'll be able to help
>> me with.
>>
>> My problem is the following.  I have a large N-gram LM file (non-binary)
>> that I built from a collection of about 200 million words.  I want to able
>> to read a given input text file (containing one sentence per line), and for
>> every N-gram that I find there, extract the probability for that N-gram from
>> the LM file.
>>
>> Currently, I am solving this problem by reading the entire LM file into
>> memory first, and then reading the N-grams from the input text file and
>> referencing the memory structure to get the probability for that N-gram.
>>  This works fine, but is very slow and memory intensive.  I can reduce the
>> memory issues by reading the input text file into memory instead, and
>> reading the LM file line-by-line, but this is somewhat less convenient due
>> to the other processing I need to perform on the input file.
>>
>> I've looked through the SRILM toolkit, and another option would seem to be
>> to filter the large LM file first using the "make-lm-subset" script and a
>> counts file built from input text file. I would then use the filtered output
>> LM in place of the larger LM and proceed as before.  This method would seem
>> to avoid the large memory requirements.  My initial tests, however, show
>> that the filtering step is still a bit slower than I'd like.
>>
>> I was wondering if there is another, more time-efficient way of solving
>> this particular problem (that is, extracting a specific subset of N-gram
>> probabilities from a large LM file) using the other tools in the SRILM
>> toolkit.  Is there some option combination for "ngram", for example, that
>> would work? I don't currently see a direct solution.
>>
>>
>> Thank you very much,
>>
>> Ryan Roth
>> CCLS
>> Columbia University
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100825/1753d4c4/attachment.html>