Class based 3-gram in SRILM

Andreas Stolcke stolcke at speech.sri.com
Fri Feb 6 13:08:33 PST 2004


In message <200402061514.i16FEIg4005091 at www4.pobox.sk>you wrote:
> Hi!
>  I have a following problem. I've estimated a class-based bigram model
> (with some defined words excluded from the clustering process) using
> the ngram-class tool. But I want to use a class-based trigram model.
> How to get class-based trigram counts and probabilities using SRILM?

You use the "replace-words-with-classes" script and apply the class definitions
to your training data.  Then you train a trigram LM in the usual way.
See training-scripts(1).

> 
>  I also want to ask whether anyone knows a freely available tool for
> word clustering using trigram counts? And it is possible to create a
> class language model based on POS-tags in SRILM?

I don't know of an available implementations for trigram-based word
clustering, but it would be quite expensive (slow) to do.
I believe some work by Philips/Aachen researchers showed that the 
improvement over bigram-induced classes (in a higher-order class-based LM)
is pretty small.  Anyway, that's what most everybody does these days.

As for POS-based LMs, all you need is a tagger (and there are many out there)
and tag your training data.  Then you use the tagged data to 
train a tag-n-gram model in the usual way.  (You can also estimate the 
class-membership probabilities from the tagging results.)

You could use the disambig tool to do the POS tagging itself, but since it 
doesn't deal with morphological and other non-n-gram cues cues (e.g.,
to handle unknown words) it won't be competitive with state-of-the-art taggers.

--Andreas 




More information about the SRILM-User mailing list