Google language model
stolcke at speech.sri.com
Tue Feb 6 17:43:50 PST 2007
In message <200702062003.l16K33Jk028807 at linus.mitre.org>you wrote:
> Hi Andreas,
> I have been using SRILM for some time now and am interested in using it
> in conjunction with the Google language model.
> >From looking at the documentation and code, I can see that it reads the
> format, but do not see strategies to keep portions of the model in
> memory and others on disk, for example. Obviously one would need to do
> something like this to hold the entire model. However, I've also used
> and tweaked enough of the code to know you're a serious hacker, and that
> I might have missed something.
> One thought I had was to point ngram-count to the Google LM, then use a
> word list to filter only the n-grams that I need SRILM to estimate
> probabilities for. Beyond that, I'm stumped.
> So, can you offer any feedback? What are some strategies you recommend
> for using the Google LM?
The Google LM (with nontrivial data size) is really meant to be used
in conjunction with the -limit-vocab option, which restricts loading
of parameters to a subset of the vocabulary (i.e., the subset used in your
test or tuning data).
An example of this appears in
BTW, there is no "Google LM" per se in SRILM. You use the "CountLM" class,
and designate the counts to be read in Google format.
See the -count-lm option as described in ngram(1) man page.
Hope this clarifies things.
More information about the SRILM-User