Google language model
Andreas Stolcke
stolcke at speech.sri.com
Tue Feb 6 17:43:50 PST 2007
In message <200702062003.l16K33Jk028807 at linus.mitre.org>you wrote:
> Hi Andreas,
>
> I have been using SRILM for some time now and am interested in using it
> in conjunction with the Google language model.
>
> >From looking at the documentation and code, I can see that it reads the
> format, but do not see strategies to keep portions of the model in
> memory and others on disk, for example. Obviously one would need to do
> something like this to hold the entire model. However, I've also used
> and tweaked enough of the code to know you're a serious hacker, and that
> I might have missed something.
>
> One thought I had was to point ngram-count to the Google LM, then use a
> word list to filter only the n-grams that I need SRILM to estimate
> probabilities for. Beyond that, I'm stumped.
>
> So, can you offer any feedback? What are some strategies you recommend
> for using the Google LM?
The Google LM (with nontrivial data size) is really meant to be used
in conjunction with the -limit-vocab option, which restricts loading
of parameters to a subset of the vocabulary (i.e., the subset used in your
test or tuning data).
An example of this appears in
$SRILM/test/tests/ngram-count-lm-limit-vocab/run-test.
BTW, there is no "Google LM" per se in SRILM. You use the "CountLM" class,
and designate the counts to be read in Google format.
See the -count-lm option as described in ngram(1) man page.
Hope this clarifies things.
Andreas
More information about the SRILM-User
mailing list