Class Language Modelling
Andreas Stolcke
stolcke at speech.sri.com
Tue Nov 19 20:20:01 PST 2002
In message <Pine.GSO.4.21.0211191112300.13692-100000 at c06.clsp.jhu.edu>you wrote
:
>
> Suppose i wish to build a language model P(w0/CW0,CW1,CW2) where CW0, CW1
> & CW2 are the equivalence classes for the predicted word and the 2
> preceding words respectively amd i wish to use absolute discounting with a
> fixed D. The input files i have available are (1) a trigram count file
> (format - w0 w1 w2 count) (2) a vocab file (3) 3 class files in format
> classno word1 word2 ....) for w0, w1 & w2 positions .
> Can someone please tell me the syntax of the ngram-count command needed to
> build a LM using this sort of a class LM as i am not very sure i
> understand it clearly.
> Thanks,
> Geetu
Geetu,
SRILM does not currently support class LMs with separate class membership
functions for the different positions in an N-gram. All word positions
must share the same class definitions.
Under these constraints, we typically train class LM as follows:
1. prepare class definition file in the format described in the
classes-format(5) manual page. this can be done by hand or from other
knowledge sources, or automatically using word clustering algorithms
(see ngram-class(1)).
it is a bad idea to use plain numbers as class names. when in doubt
use names like CLASS1, CLASS2, etc. this avoids confusion in places where
a file can be either a class name, word, or integer count.
2. condition the training data or counts to replace words with class labels,
using the "replace-words-with-classes" filter (see training-scripts(1)
man page).
3. run ngram-count on the result of step 2.
Although multiple class definitions for different word positions are not
supported by the above training procedure, or the LM evaluation code,
there is a fairly straightforward way to fake it.
I'm assuming now that classes expand to exactly one word at a time,
and that a word has a unique class in a given ngram position.
You need to write a filter that maps word ngram counts to
class ngram counts (w1 w2 w3 N -> c1 c2 c3 N, and similarly for unigrams and
bigrams). then you can train and evaluate your class LM by operating on
counts rather than text.
to train:
ngram-counts -text DATA -write - | word-to-class-filter | \
ngram-counts -read - -lm LM [smoothing-options]
Similary, you can map the test data to counts, filter them, and use the
ngram -counts option to compute perplexities and log probabilities from
counts.
there is one detail in LM estimation: you need to prevent class labels that
can only occur in the history portion of an ngram from receiving backoff
probability mass as a result of smoothing . you can accomplish that
by listing those not-to-be-predicted classes in a file, and specifying
them with the ngram-count -nonevents option. see the man page for
details. you need to also keep track of the probabilities incurred
for replacing a word by its class for each word in the test set.
(the filter script could do that as a side effect), and add the log
probability for class expansions to the log probability for
class ngrams.
hope this helps,
--Andreas
More information about the SRILM-User
mailing list