[SRILM User List] class-based LM

denisdeis denis_2046 at hotmail.com
Wed Feb 24 09:56:22 PST 2010

Dear Dr Andreas Stolcke,

I want to train a class-based model for a dataset d1. Now I have a dictionary
with 65k vocabs. The dictionary was created by several different datasets and
d1 can only cover 30k vocabs in the dictionary . I also trained the word-ngram
model for d1 and want to interpolate with the class-based model. I used
ngram-class to get class definition and I also used replace-words-with-classes to replace words with classes in
the text. The next step should be use ngram-count to train the LM. The command
I used is below:

-tolower  -vocab dict  -text  text_withclassreplace   ¨Ckndiscount2   ¨Cgt1min  1  -gt2min
1  -gt3min 2 -lm  class_based_model

I have two questions about the command above:

1) what should the dict include? If I am correct, it should be something like
class1, class2, class3... etc because we treat classes as words here. If I have
50 classes, it should include class1-class50. However, as I said above, d1 only
covers 30k vocabs. So there are 35k vocabs which never appear in d1. How can I
deal with them? (For the word ngram model, their unigrams are computed and
added in the model) Can I include them in a special class and add the class
into the class definition obtained from ngram-class? If I do so, I need to
calculate the probabilities for each word in the special classs. But how? 
Another solution is the 35k vocabs could be added in the dict following the
list class1-class50. But after I got the LM, I found the unigrams for the 35k
vocabs are all -99. Is it reasonable?

2) When I train the class-based LM, ngram-count always give a warning message
as below:

warning : no singleton counts

GT discounting disabled

Actually, here I used the option "¨Ckndiscount2". I don't why it said
"GT discounting disabled". Besides, what does "no singleton
counts" mean? Does it matter? Even thought I got this kind of message, I could
still get a class based LM output.

Anyone could give me some helps? I checked all the questions in this list but I
haven't found answers. Thanks for your help.


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100224/3ebc7f54/attachment.html>

More information about the SRILM-User mailing list