question on the tool

Thu Aug 1 09:34:37 PDT 2002

In message <20020801150317.36624.qmail at web12503.mail.yahoo.com>you wrote:
> 
> Dear Dr. Stolcke,
> 
> I have one more question on your
> "replace-words-with-classes" tool, please. 
> 
> I used the "ngram-class" program to generate a set of
> classes using some broadcast news corpus 
> (223,091 unique words) and specifying the vocab to be 
> a 36875 words dict. And the output of the classes
> contains the mapping of 35325 words, as I can see,
> 187,766 OOVs have no mappings for them, since they've
> been treated as the unknown words. This should be no
> problem. But when using the generate classes to
> replace the word-based trans to be class-based trans,
> problem occured. The OOVs could not be mapped into any
> classes (since there is no mapping for such words in
> the classes file), thus they remain there! But in my
> knowledge, if we want to learn the class-ngram, an
> usual form for it to be interpolate with word-ngram is
> like:
> 
> ^
> P (w3 | w1, w2) = lambda * Pw (w3 | w1, w2)
> + (1-lambda) * P (w3 | G3) * Pc (G3| G1, G2)
> where wi belongs to class Gi, i=1, 2, 3, respectively.
> 
> So my question is, with the classes/words mixed trans,
> can we really obtain the correct class-ngram
> probabilities? 
> 
> Here are the commands I've been using:
> 
> 1) ngram-class -debug 0 -text <bn-corpus> -vocab
> <36K-dict> -numclasses 1000 -classes <cls>
> 
> 2) replace-words-with-classes classes=<cls>
> <bn-corpus> > <bn-cls-corpus>

You need to add the class definition for the "unknown" word class
yourself.    I would recommend that you prevent <unk> from being merged
with any other word class.  You can do this by creating a file containing

	<s>
	</s>
	<unk>

and then invoking ngram-class -noclass-vocab and that file as argument.
Then you add a new unknown word class to the class definitions from
ngram-class, and put all the remaining words in that class.
(This assumes you actually want your overall LM vocabulary to contain
all 223,091 words.  If the word ngram maps those to <unk> then 
the class-ngram should do the same, and no modifications to the 
class definitions are needed.)

> 
> I did a small perl script to post-process the
> mixed-trans, but then I think there could be another
> problem. Too many unknown words will be mapped into 
> one single unknown class, which somehow, could disturb
> the real probabilities of the class-ngram that we
> should have. 

I'm not sure what you mean by "disturb the real probabilities".
But if you want all the words in the class-lm then they have to
get their probability somewhere, and a single class seems like
a reasonable approach.  this will smooth their probabilities when
interpolated with the word ngram, which treats all those low-frequency
words as separate.  a more sophisticated approach would maybe try 
to distinguish the words based on their morphology, but that would require
some significant work.

> Also, I used the command mentioned in your paper to
> expand the built-up class-ngram model: 
> 
> ngram -lm <cls-lm> -prune 1e-5 -expand-exact 3
> -write-lm <exp-lm>
> 
> but as I read the expanded model, I can see there
> are only probabilites for class (1,2,3-grams), but
> no membership distribution, i.e. no P (w3 | G3). Then
> how can it be interpolate with the word-level LM
> correctly?

First, the ngram -expand function also needs the -classes option
to read in the class definitions.  However, I suspect that with
a corpus like BN it will not be feasible to expand the class-ngram
to a word-ngram, there are just too many word ngrams resulting 
from such an expansion.  Even the pruning won't help you because 
pruning happens AFTER the expansion.

You don't need to expand a class-ngram to interpolate with a word ngram.
just use

	ngram -lm WORD-LM -mix-lm CLASS-LM -classes CLASSDEFS \
		-lambda WORD-LM-WEIGHT -bayes 0

followed by other options to compute perplexity etc.

> if there is any news board on using SRI Toolkit, then
> I could turn to the community for help, instead of
> taking too much of your time. Many thanks!

Indeed there is a mailing list for SRILM users.
To join, mail the line "subscribe srilm-user" (in the message body)
to majordomo at speech.sri.com.

Regards,

--Andreas