incremental ngram-count

Alexy Khrabrov deliverable at
Thu Nov 1 14:42:55 PDT 2007

A separate task I do on a corpus is computing a "running ngram  
count": for each "tick" size of a subset of the corpus, e.g. 10%,  
20%, etc., or every N files, or every file, show the *increase* in  
the number of ngrams.

Obviously building sublists of files with a single file added and  
rerunning ngram-count on such lists is inefficient.  Is it the case  
where I have to get into C++ library indeed, and which classes should  
I use?  Basically, I want to know which *new* ngrams are contributed  
by a given file, in the sequence of processing.  I may want to output  
them separately for look-see, too.


More information about the SRILM-User mailing list