Google language model

Tue Feb 6 17:43:50 PST 2007

In message <200702062003.l16K33Jk028807 at linus.mitre.org>you wrote:
> Hi Andreas,
> 
> I have been using SRILM for some time now and am interested in using it
> in conjunction with the Google language model.
> 
> >From looking at the documentation and code, I can see that it reads the
> format, but do not see strategies to keep portions of the model in
> memory and others on disk, for example.  Obviously one would need to do
> something like this to hold the entire model.  However, I've also used
> and tweaked enough of the code to know you're a serious hacker, and that
> I might have missed something.
> 
> One thought I had was to point ngram-count to the Google LM, then use a
> word list to filter only the n-grams that I need SRILM to estimate
> probabilities for.  Beyond that, I'm stumped.
> 
> So, can you offer any feedback?  What are some strategies you recommend
> for using the Google LM?  

The Google LM (with nontrivial data size) is really meant to be used 
in conjunction with the -limit-vocab option, which restricts loading 
of parameters to a subset of the vocabulary (i.e., the subset used in your
test or tuning data).

An example of this appears in
$SRILM/test/tests/ngram-count-lm-limit-vocab/run-test.

BTW, there is no "Google LM" per se in SRILM.  You use the "CountLM" class,
and designate the counts to be read in Google format.
See the -count-lm option as described in ngram(1) man page.

Hope this clarifies things.

Andreas