Saving option for ngram-class
Andreas Stolcke
stolcke at speech.sri.com
Thu Nov 1 11:33:09 PDT 2007
In message <159333.1460.qm at web31611.mail.mud.yahoo.com>you wrote:
> Hi,
> I guess the -save options as implemented in ngram-class is not very useful.
I agree.
> Typically, I'm not interesting in testing classes as appearing on the beginni
> ng of the clustering process, but rather in classes induced in final steps. I
> f the number of clustered words is high, the current option results in creati
> ng an enormous number of useless files.
>
> It'd be much more practical if the user could explicitly set which classes wi
> th different granularity should be saved, or, alternatively, to have some -st
> artsave option which'd allow to start saving class files close to the end of
> the clustering.
>
> Would that be easy to implement?
The next release (due out soon) will have a new option
-save-maxclasses K
Modifies the action of -save so as to only start
saving once the number of classes reaches K. (The
iteration numbers embedded in filenames will start
at 0 from that point.)
>
> One more thing, is there an easy way how to find how many classes appear in p
> articular class file without writing a script? The number of iterations doesn
> 't say that directly and I'm not sure whether it can be computed as NUMBER_OF
> _WORDS_IN_THE_VOCAB - NUMBER_OF_ITERATIONS - NUMBER_OF_WORDS_IN_THE_NO_CLASS_
> VOCAB
You can get the number of classes from the class definition file with
gawk '{ print $1 }' | uniq | wc -l
This shouldn't be needed when using the -save-maxclasses option since you
specific the number of classes directly (and then each new saved file has
S fewer classes, where S is the argument to -save).
Andreas
More information about the SRILM-User
mailing list