[SRILM User List] class-based LM
denisdeis
denis_2046 at hotmail.com
Wed Feb 24 09:56:22 PST 2010
Dear Dr Andreas Stolcke,
I want to train a class-based model for a dataset d1. Now I have a dictionary
with 65k vocabs. The dictionary was created by several different datasets and
d1 can only cover 30k vocabs in the dictionary . I also trained the word-ngram
model for d1 and want to interpolate with the class-based model. I used
ngram-class to get class definition and I also used replace-words-with-classes to replace words with classes in
the text. The next step should be use ngram-count to train the LM. The command
I used is below:
ngram-count
-tolower -vocab dict -text text_withclassreplace ¨Ckndiscount2 ¨Cgt1min 1 -gt2min
1 -gt3min 2 -lm class_based_model
I have two questions about the command above:
1) what should the dict include? If I am correct, it should be something like
class1, class2, class3... etc because we treat classes as words here. If I have
50 classes, it should include class1-class50. However, as I said above, d1 only
covers 30k vocabs. So there are 35k vocabs which never appear in d1. How can I
deal with them? (For the word ngram model, their unigrams are computed and
added in the model) Can I include them in a special class and add the class
into the class definition obtained from ngram-class? If I do so, I need to
calculate the probabilities for each word in the special classs. But how?
Another solution is the 35k vocabs could be added in the dict following the
list class1-class50. But after I got the LM, I found the unigrams for the 35k
vocabs are all -99. Is it reasonable?
2) When I train the class-based LM, ngram-count always give a warning message
as below:
warning : no singleton counts
GT discounting disabled
Actually, here I used the option "¨Ckndiscount2". I don't why it said
"GT discounting disabled". Besides, what does "no singleton
counts" mean? Does it matter? Even thought I got this kind of message, I could
still get a class based LM output.
Anyone could give me some helps? I checked all the questions in this list but I
haven't found answers. Thanks for your help.
Denis
_________________________________________________________________
MSNÊ®Äê»ØÀ¡£¬Ã¿Î»Óû§¿ÉÃâ·Ñ»ñµÃ¼ÛÖµ25ÔªµÄ¿¨°Í˹»ù·´²¡¶¾Èí¼þ2010¼¤»îÂ룬¿ìÀ´ÁìÈ¡£¡
http://kaba.msn.com.cn/?k=1
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100224/3ebc7f54/attachment.html>
More information about the SRILM-User
mailing list