ngram-class is too time consuming
Andreas Stolcke
stolcke at speech.sri.com
Tue Oct 28 11:24:45 PDT 2008
??? wrote:
> I want to use the class based Bigram , like this:
> P (w2 | w1) = lambda * Pw (w2 | w1)+ (1-lambda) * P (w2 | G2) * Pc
> (G2| G1)
> where wi belongs to class Gi, i=1, 2, respectively.
> So I used the "ngram-class" program to generate a set of classes using
> some corpus (282,360 unique words),
> And the output classnum is 2,000.
> but I found the time of this program is too long,maybe for 10 days. my
> computer is Core2, 1.8G.
> Here is my command:
> ngram-class -text<word-corpus> -numclasses 2000-classes<cls> -incremental
>
> does it has some problem? or it is normal?
It's probably normal. 282k is quite a large vocabulary. You might want
to play with difference vocab sizes, especially excluding words with
very low counts (such as singletons), because their statistics are not
reliable and won't be clustered properly. It might be best to group all
those words in a special class ahead of time.
For comparison, running the small test in $SRILM/test
make TEST=class-ngram
should take about 0.15 seconds of cpu time on a 2.6GHz Opteron machine.
Andreas
More information about the SRILM-User
mailing list