[SRILM User List] Question Regarding SRILM N-gram tools

Tue Aug 24 14:49:30 PDT 2010

Ryan,

I suggested you use the -limit-vocab option with ngram, and write out 
your LM in binary.
Reading a binary LM with -limit-vocab is very efficient in processing 
only the portions of the LM parameters that pertain to your test set 
vocabulary.
You can generate the vocabulary used by your test data using

ngram-count -text DATA -write-vocab VOCAB

There is a tradeoff between processing small batches of data (hence 
small vocabularies, hence fast loading of the LM) with large batches 
(larger vocabularies, but the LM fewer times), so you might want to tune 
the batch size empirically for best overall throughput.

If LM load time is still a limiting factor with this approach you should 
use an LM server (see ngram -use-server option), which effectively means 
you load the LM into memory only once.

I suggest you join the srilm-user list and direct future questions there.

Andreas

Ryan Roth wrote:
> Hello:
>
> My name is Ryan Roth and I work at Columbia University's Center for 
> Computational Learning Systems. My research focus currently is on 
> Arabic Natural Language Processing.
>
> I have question about the SRILM toolkit that I hope you'll be able to 
> help me with.
>
> My problem is the following.  I have a large N-gram LM file 
> (non-binary) that I built from a collection of about 200 million 
> words.  I want to able to read a given input text file (containing one 
> sentence per line), and for every N-gram that I find there, extract 
> the probability for that N-gram from the LM file.
>
> Currently, I am solving this problem by reading the entire LM file 
> into memory first, and then reading the N-grams from the input text 
> file and referencing the memory structure to get the probability for 
> that N-gram.  This works fine, but is very slow and memory intensive.  
> I can reduce the memory issues by reading the input text file into 
> memory instead, and reading the LM file line-by-line, but this is 
> somewhat less convenient due to the other processing I need to perform 
> on the input file.
>
> I've looked through the SRILM toolkit, and another option would seem 
> to be to filter the large LM file first using the "make-lm-subset" 
> script and a counts file built from input text file. I would then use 
> the filtered output LM in place of the larger LM and proceed as 
> before.  This method would seem to avoid the large memory 
> requirements.  My initial tests, however, show that the filtering step 
> is still a bit slower than I'd like.
>
> I was wondering if there is another, more time-efficient way of 
> solving this particular problem (that is, extracting a specific subset 
> of N-gram probabilities from a large LM file) using the other tools in 
> the SRILM toolkit.  Is there some option combination for "ngram", for 
> example, that would work? I don't currently see a direct solution.
>
>
> Thank you very much,
>
> Ryan Roth
> CCLS
> Columbia University
>