behaviour of class models

Wed May 7 08:01:59 PDT 2008

Hi,

I have recently found the unexpected behaviour of class based models in SRILM. It will probably be useful to know about this for other people who also deal with such models for inflectional languages.

I now deal with liguistically motivated classes. What I have is, for example, a stem-based model. That is, stems are regarded as classes for wordforms, as encoded in a class definition file that I generate myself, that's rather straightforward.
Then, when I calculate perplexity with that model and in interpolation with the conventional word LM, it appears much lower than I expect for the given data and vocabulary. At the same time some stems (that serve as classes) coincid with some of the wordforms (that is natural) - so I had the feeling the unexpected numbers are the results that in SRILM ngram with the -classes option can treat a class LM file as consisting of both class markers and wordforms (is some entry is not listed among classes). That probably screws the results in my case. After I added to each stem in both class definition and LM file a postfix, that guaranteed there are no stems that coincide with wordforms, the perplexity results became much more realistic.

best regards,
Ilya

      __________________________________________________________
Sent from Yahoo! Mail.
A Smarter Email http://uk.docs.yahoo.com/nowyoucan.html