ngram-class with -incremental + -save-maxclasses

Andreas Stolcke stolcke at speech.sri.com
Thu Apr 3 22:05:17 PDT 2008


Matt Lease wrote:
> What is the behavior of -save-maxclasses for ngram-class when 
> -incremental is used?  My understanding of -incremental is that C as 
> specified by -numclasses determines the number of classes for the 
> entire run-time (i.e. C+1 for the new word being merged into the 
> existing C classes), in which case -save-maxclasses would seem not to 
> add anything (ie perhaps it's only intended for V^3 clustering).
Incremental merging works differently.  It first makes one class per 
word (typically giving a number >> C), then  merges the classes starting 
at C+1 into the first C until only C classes are left.
So the -save-maxclasses option has the intended effect.
>
> If one wanted to get different clusterings with the greedy algorithm 
> without re-running each from scratch, it looks like you can use the 
> -class-counts option and then feed this counts file into a subsequent 
> invocation of ngram-class.  For example, run it initially with C=1000, 
> then feed the output class counts into a second invocation with C=500, 
> say.  Is this the correct procedure?
It will work in principle, except that the second run will have no 
access to the original word vocabulary, so the class definitions it 
produces will be in terms of the class vocabulary produced by the first 
run.  Also (I haven't checked this), there might be name collisions 
since the "words" and "classes" use the same names.
What is really needed (but not implemented so far) is a mechanism for 
reading the saved classes and counts from a prior run into ngram-class 
and continue the merging from there.

Andreas





More information about the SRILM-User mailing list