Class-based LM using the SRILM toolkit?

Tue Apr 24 10:29:37 PDT 2007

In message <d4929ad00704181101t30f6d973s986f692b0010e2ca at mail.gmail.com>you wro
te:
> Dear Dr. Stolcke,
> 
> Thank you for your attention.
> 
> Is there no way to construct a class-based LM by pre-defining the
> classes to be used (vis-a-vis inducing them)? The class-format man
> page does mention how classes may be defined by hand, but this format
> requires the specification of the class expansion probabilities as
> well. Can these probabilities be calculated by a program in the
> toolkit? Correct me if I'm wrong, but these probabilities are given by
> (for a certain word wi, and class ci) : Number of times wi occurs in
> class ci/Number of times words in class ci occur.

You 

(1) define your classes by hand, using dummy probabilities.
(2) use the replace-words-with-classes with options
		outfile=FILE normalize=1
    on some training data. This is documented in the training-scripts(5)
    man page.

> Also, is the file that is generated by the ngram-class -class-counts
> option in the same format as class-format? Can a file in the
> class-format format be used directly by the ngram-count program to
> learn a class-based LM?

The -class-counts output is in the right format to be used as a count
input file for ngram-count to estimate a bigram LM for the class labels.
However, this will only work for bigram LMs since ngram-class doesn't
use higher-order statistics.  The recommended procedure is to
again use the replace-words-with-classes command to insert class
labels in your LM training data, and then use ngram-count on
the transformed data to estimate the class ngram probabilities.

Andreas