class LM

Andreas Stolcke stolcke at
Fri Sep 20 12:48:34 PDT 2002

In message <3D8B0E33.3148460E at>you wrote:
> This is a multi-part message in MIME format.
> --Boundary_(ID_ESpZRbOj8hesWAoqG6iE3Q)
> Content-type: text/plain; charset=us-ascii
> Content-transfer-encoding: 7BIT
> Hi all,
> I have a question about the class-based models. I have just started to
> use them.
> First I want to understand the test example in the toolkit.
> I have problems with understanding the probability computation of the
> devtest.text
> Can you, please, explain me, which 1grams, 2grams, 3grams.... are meant
> for example in this sentence:

The ClassNgram LM performs dynamic programming to compute the prefix
probabilities of sentences (and from that, the conditional word probabilities).
This is done because a given word can be "generated" by the LM either
as a plain word or as a member of one of several classes (and in the 
case of multi-word class expansions at different positions in the 
expansion).  So the states in the DP trellis correspond to the different
classes and the positions of the word in the expansion. (This is 
very similar to the notion of a "dotted item" in context-free 
parsing, in case you're familiar with that).

As a result, many N-gram lookups are performed to go from one word to the
next: one for each state transition in the trellis.
So when you see something like

> kaybeck and lost ok
>  p( kaybeck | <s> )  = [1gram][2gram] 0.000845361 [ -3.07296 ] / 1

it means that both a unigram and a bigram lookup happened.  In this 
particular case this makes sense because p(kaybeck|<s>) is probably 
not among the bigrams (hence backoff to unigram), but 
p(GRIDLABEL | <s>) is a bigram (kaybeck is a member of class GRIDLABEL).
The probabilities of both cases are summed to obtain the total probability of
the word.

I believe you can set -debug 4 to trace the state transitions in the DP trellis.
It becomes a little unwieldy to follow as the number of states increase as 
you get deeper into the sentence.

>  p( and | kaybeck ...)  = [1gram][3gram] 0.443827 [ -0.352786 ] / 1
>  p( lost | and ...)  = [2gram][2gram][4gram][4gram] 0.0305452 [ -1.51506
> ] / 1
>  p( ok | lost ...)  = [3gram][3gram][4gram][4gram] 0.0703371 [ -1.15282
> ] / 0.999999
>  p( </s> | ok ...)  = [3gram][4gram] 0.401395 [ -0.396428 ] / 1
> I am familiar with the class model, where all words are mapped to
> classes.
> In this example, there are only two classes (GRIDLABEL and
> in the model we have ngrams of words and ngrams of words and classes.

We generalized the class model so that mixed N-grams of words and classes
are allowed for convenience.  This is however just equivalent to having 
an extra class for each word that contains only that word itself.

> I understand the idea, that if n-gram of words exists in is better to
> use it
> and if not, classes should help.

As I explained above, it's not an either-or.  You compute probabilities
for all the ways of generating a word, and sum. 

> But what are the steps in probability computation?

Again, tracing the DP with the -debug option will give you a sense of 
the details.  You might have to also read the code for ClassNgram:prefixProb()
to get the full picture.


More information about the SRILM-User mailing list