Saving option for ngram-class

Andreas Stolcke stolcke at speech.sri.com
Thu Nov 1 11:33:09 PDT 2007


In message <159333.1460.qm at web31611.mail.mud.yahoo.com>you wrote:
> Hi,
>  I guess the -save options as implemented in ngram-class is not very useful. 

I agree.

> Typically, I'm not interesting in testing classes as appearing on the beginni
> ng of the clustering process, but rather in classes induced in final steps. I
> f the number of clustered words is high, the current option results in creati
> ng an enormous number of useless files.
> 
> It'd be much more practical if the user could explicitly set which classes wi
> th different granularity should be saved, or, alternatively, to have some -st
> artsave option which'd allow to start saving class files close to the end of 
> the clustering.
> 
> Would that be easy to implement?

The next release (due out soon) will have a new option 

       -save-maxclasses K
              Modifies  the  action  of -save so as to only start
              saving once the number of classes reaches K.   (The
              iteration  numbers embedded in filenames will start
              at 0 from that point.)

> 
> One more thing, is there an easy way how to find how many classes appear in p
> articular class file without writing a script? The number of iterations doesn
> 't say that directly and I'm not sure whether it can be computed as NUMBER_OF
> _WORDS_IN_THE_VOCAB - NUMBER_OF_ITERATIONS - NUMBER_OF_WORDS_IN_THE_NO_CLASS_
> VOCAB

You can get the number of classes from the class definition file with

gawk '{ print $1 }' | uniq | wc -l

This shouldn't be needed when using the -save-maxclasses option since you
specific the number of classes directly (and then each new saved file has
S fewer classes, where S is the argument to -save).

Andreas 




More information about the SRILM-User mailing list