[SRILM User List] class-based LM
Andreas Stolcke
stolcke at speech.sri.com
Thu Feb 25 21:38:37 PST 2010
On 2/24/2010 9:56 AM, denisdeis wrote:
>
> Dear Dr /Andreas/ Stolcke,
>
> I want to train a class-based model for a dataset d1. Now I have a
> dictionary with 65k vocabs. The dictionary was created by several
> different datasets and d1 can only cover 30k vocabs in the dictionary
> . I also trained the word-ngram model for d1 and want to interpolate
> with the class-based model. I used ngram-class to get class definition
> and I also used replace-words-with-classes to replace words with
> classes in the text. The next step should be use ngram-count to train
> the LM. The command I used is below:
>
> ngram-count -tolower -vocab dict -text text_withclassreplace
> ¨Ckndiscount2 ¨Cgt1min 1 -gt2min 1 -gt3min 2 -lm class_based_model
>
> I have two questions about the command above:
>
> 1) what should the dict include? If I am correct, it should be
> something like class1, class2, class3... etc because we treat classes
> as words here. If I have 50 classes, it should include class1-class50.
> However, as I said above, d1 only covers 30k vocabs. So there are 35k
> vocabs which never appear in d1. How can I deal with them? (For the
> word ngram model, their unigrams are computed and added in the model)
> Can I include them in a special class and add the class into the class
> definition obtained from ngram-class? If I do so, I need to calculate
> the probabilities for each word in the special classs. But how?
> Another solution is the 35k vocabs could be added in the dict
> following the list class1-class50. But after I got the LM, I found the
> unigrams for the 35k vocabs are all -99. Is it reasonable?
>
The vocabulary should be the union of all classes, as well all words
there weren't replaced by classes (that makes all the "events" that can
occur in the class LM.
>
>
> 2) When I train the class-based LM, ngram-count always give a warning
> message as below:
>
> warning : no singleton counts
> GT discounting disabled
>
> Actually, here I used the option "¨Ckndiscount2". I don't why it said
> "GT discounting disabled". Besides, what does "no singleton counts"
> mean? Does it matter? Even thought I got this kind of message, I could
> still get a class based LM output.
>
Both GT and KN smoothing methods rely in having the number of words that
occur only once in the training corpus. It is very common for class LMs
to have singletons in their training data (because of the class
replacement). You should choose another smoothing method. -wbdiscount
should work fine.
Andreas
>
>
> Anyone could give me some helps? I checked all the questions in this
> list but I haven't found answers. Thanks for your help.
>
> Denis
>
>
> ------------------------------------------------------------------------
> ʹÓÃÐÂÒ»´ú Windows Live Messenger ÇáËɽ»Á÷ºÍ¹²Ïí£¡ Á¢¿ÌÏÂÔØ£¡
> <http://www.windowslive.cn/Messenger/>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20100225/bb10d9e0/attachment.html>
More information about the SRILM-User
mailing list