cahce based models
j.ganitkevitch at googlemail.com
Thu Mar 8 01:11:32 PST 2007
> I can infer from your answers that to use cache model, I can either:
> 1- Use the subclass CacheLM using a programming language.
> 2- use the option -cache with the ngram command.
Actually, -cache uses the implementation given in the CacheLM class.
If you want to extend fuctionality I figure your best bet would be to
either extend the CacheLM or LM class (don't think any other language
than C/C++ would be good here, as you'll get horrible performance for
invoking wrappers for every word).
You would then need to plug your class in ngram (possibly ngram-count
as well if you have stuff to count/train). This is actually quite
simple, you can best observe the steps necessary by searching for
cache in ngram.cc. You'll find essentially two parts, one where
command line parameters are defined and mapped to variables and a
second where the model is initiated and mixed into the current model
> I still prefer to master the existing commands before using any API,
> so now, suppose I want to use ngram -cache 10
To my knowledge (this would vary with texts and languages of course)
a value of 100 is a good starting point
> and I would like to define to word classes,
> The pdf paper says that "Word classes may be defined manually". I
> would like to know how to do that, and how to pass the classes file to
Given the current code, I figure you'll need to implement your own
cache model, as this one does not incorporate any kind of word class
support. Either you map words to classes (and operate on those) in
your model, or you have a LM wrapper (a bit like the classes that
provide for combining LMs) that feeds the cache model with classes
rather than words. Sadly I don't know if there is such an approach
implemented in SRILM.
Documentation is a bit sparse, true. As long as you don't want to
code around in SRILM the manpages and -help options provide you with
a bit of an overview.
For coding I have found it to be helpful to follow the course main()
in either ngram or ngram-count to figure out how it works. Code's
clean and the naming gives you a good insight about what's going on.
More information about the SRILM-User