Class Language Modelling

Andreas Stolcke stolcke at speech.sri.com
Tue Nov 19 20:20:01 PST 2002


In message <Pine.GSO.4.21.0211191112300.13692-100000 at c06.clsp.jhu.edu>you wrote
:
> 
> Suppose i wish to build a language model P(w0/CW0,CW1,CW2) where CW0, CW1
> & CW2 are the equivalence classes for the predicted word and the 2
> preceding words respectively amd i wish to use absolute discounting with a
> fixed D. The input files i have available are (1) a trigram count file
> (format - w0 w1 w2 count) (2) a vocab file (3) 3 class files in format
> classno word1 word2 ....) for w0, w1 & w2 positions . 
> Can someone please tell me the syntax of the ngram-count command needed to
> build a LM using this sort of a class LM as i am not very sure i
> understand it clearly.
> Thanks,
> Geetu

Geetu,

SRILM does not currently support class LMs with separate class membership
functions for the different positions in an N-gram.  All word positions
must share the same class definitions.

Under these constraints, we typically train class LM as follows:

1. prepare class definition file in the format described in the 
   classes-format(5) manual page.  this can be done by hand or from other
   knowledge sources, or automatically using word clustering algorithms
   (see ngram-class(1)).

   it is a bad idea to use plain numbers as class names.  when in doubt 
   use names like CLASS1, CLASS2, etc.  this avoids confusion in places where
   a file can be either a class name, word, or integer count.

2. condition the training data or counts to replace words with class labels,
   using the "replace-words-with-classes" filter (see training-scripts(1) 
   man page).

3. run ngram-count on the result of step 2.

Although multiple class definitions for different word positions are not 
supported by the above training procedure, or the LM evaluation code,
there is a fairly straightforward way to fake it.
I'm assuming now that classes expand to exactly one word at a time,
and that a word has a unique class in a given ngram position.

You need to write a filter that maps word ngram counts to
class ngram counts (w1 w2 w3 N -> c1 c2 c3 N, and similarly for unigrams and
bigrams). then you can train and evaluate your class LM by operating on
counts rather than text.
to train:

	ngram-counts -text DATA -write - | word-to-class-filter | \
	ngram-counts -read - -lm LM [smoothing-options]

Similary, you can map the test data to counts, filter them, and use the 
ngram -counts option to compute perplexities and log probabilities from
counts.

there is one detail in LM estimation: you need to prevent class labels that
can only occur in the history portion of an ngram from receiving backoff 
probability mass as a result of smoothing .  you can accomplish that 
by listing those not-to-be-predicted classes in a file, and specifying 
them with the ngram-count -nonevents option.  see the man page for 
details.  you need to also keep track of the probabilities incurred 
for replacing a word by its class for each word in the test set.
(the filter script could do that as a side effect), and add the log 
probability for class expansions to the log probability for 
class ngrams.

hope this helps,

--Andreas




More information about the SRILM-User mailing list