question on the tool
Andreas Stolcke
stolcke at speech.sri.com
Thu Aug 1 09:34:37 PDT 2002
In message <20020801150317.36624.qmail at web12503.mail.yahoo.com>you wrote:
>
> Dear Dr. Stolcke,
>
> I have one more question on your
> "replace-words-with-classes" tool, please.
>
> I used the "ngram-class" program to generate a set of
> classes using some broadcast news corpus
> (223,091 unique words) and specifying the vocab to be
> a 36875 words dict. And the output of the classes
> contains the mapping of 35325 words, as I can see,
> 187,766 OOVs have no mappings for them, since they've
> been treated as the unknown words. This should be no
> problem. But when using the generate classes to
> replace the word-based trans to be class-based trans,
> problem occured. The OOVs could not be mapped into any
> classes (since there is no mapping for such words in
> the classes file), thus they remain there! But in my
> knowledge, if we want to learn the class-ngram, an
> usual form for it to be interpolate with word-ngram is
> like:
>
> ^
> P (w3 | w1, w2) = lambda * Pw (w3 | w1, w2)
> + (1-lambda) * P (w3 | G3) * Pc (G3| G1, G2)
> where wi belongs to class Gi, i=1, 2, 3, respectively.
>
> So my question is, with the classes/words mixed trans,
> can we really obtain the correct class-ngram
> probabilities?
>
> Here are the commands I've been using:
>
> 1) ngram-class -debug 0 -text <bn-corpus> -vocab
> <36K-dict> -numclasses 1000 -classes <cls>
>
> 2) replace-words-with-classes classes=<cls>
> <bn-corpus> > <bn-cls-corpus>
You need to add the class definition for the "unknown" word class
yourself. I would recommend that you prevent <unk> from being merged
with any other word class. You can do this by creating a file containing
<s>
</s>
<unk>
and then invoking ngram-class -noclass-vocab and that file as argument.
Then you add a new unknown word class to the class definitions from
ngram-class, and put all the remaining words in that class.
(This assumes you actually want your overall LM vocabulary to contain
all 223,091 words. If the word ngram maps those to <unk> then
the class-ngram should do the same, and no modifications to the
class definitions are needed.)
>
> I did a small perl script to post-process the
> mixed-trans, but then I think there could be another
> problem. Too many unknown words will be mapped into
> one single unknown class, which somehow, could disturb
> the real probabilities of the class-ngram that we
> should have.
I'm not sure what you mean by "disturb the real probabilities".
But if you want all the words in the class-lm then they have to
get their probability somewhere, and a single class seems like
a reasonable approach. this will smooth their probabilities when
interpolated with the word ngram, which treats all those low-frequency
words as separate. a more sophisticated approach would maybe try
to distinguish the words based on their morphology, but that would require
some significant work.
> Also, I used the command mentioned in your paper to
> expand the built-up class-ngram model:
>
> ngram -lm <cls-lm> -prune 1e-5 -expand-exact 3
> -write-lm <exp-lm>
>
> but as I read the expanded model, I can see there
> are only probabilites for class (1,2,3-grams), but
> no membership distribution, i.e. no P (w3 | G3). Then
> how can it be interpolate with the word-level LM
> correctly?
First, the ngram -expand function also needs the -classes option
to read in the class definitions. However, I suspect that with
a corpus like BN it will not be feasible to expand the class-ngram
to a word-ngram, there are just too many word ngrams resulting
from such an expansion. Even the pruning won't help you because
pruning happens AFTER the expansion.
You don't need to expand a class-ngram to interpolate with a word ngram.
just use
ngram -lm WORD-LM -mix-lm CLASS-LM -classes CLASSDEFS \
-lambda WORD-LM-WEIGHT -bayes 0
followed by other options to compute perplexity etc.
> if there is any news board on using SRI Toolkit, then
> I could turn to the community for help, instead of
> taking too much of your time. Many thanks!
Indeed there is a mailing list for SRILM users.
To join, mail the line "subscribe srilm-user" (in the message body)
to majordomo at speech.sri.com.
Regards,
--Andreas
More information about the SRILM-User
mailing list