behaviour of class models

Andreas Stolcke stolcke at speech.sri.com
Wed May 7 10:14:15 PDT 2008


In message <67936.54080.qm at web25405.mail.ukl.yahoo.com>you wrote:
> Hi,
> 
> I have recently found the unexpected behaviour of class based models in SRILM
> . It will probably be useful to know about this for other people who also dea
> l with such models for inflectional languages.
> 
> I now deal with liguistically motivated classes. What I have is, for example,
>  a stem-based model. That is, stems are regarded as classes for wordforms, as
>  encoded in a class definition file that I generate myself, that's rather str
> aightforward.
> Then, when I calculate perplexity with that model and in interpolation with t
> he conventional word LM, it appears much lower than I expect for the given da
> ta and vocabulary. At the same time some stems (that serve as classes) coinci
> d with some of the wordforms (that is natural) - so I had the feeling the une
> xpected numbers are the results that in SRILM ngram with the -classes option 
> can treat a class LM file as consisting of both class markers and wordforms (
> is some entry is not listed among classes). That probably screws the results 
> in my case. After I added to each stem in both class definition and LM file a
>  postfix, that guaranteed there are no stems that coincide with wordforms, th
> e perplexity results became much more realistic.

To clarify:

A class-based LM can contain N-grams that mix word and class labels.
For example, it might contain the N-gram "a B c D" where the lower-case
tokens are words and the upper-case tokens are classes.
However, you still need to keep words and classes separate.  so "B" should
not occur as a word both as a word and class labels (on the left and right
hand colummns of a class definitions file).
So you need some kind of spelling convention that distinguishes words and 
classes if there are conflicts.

Andreas 




More information about the SRILM-User mailing list