deliverable at gmail.com
Thu Nov 1 14:42:55 PDT 2007
A separate task I do on a corpus is computing a "running ngram
count": for each "tick" size of a subset of the corpus, e.g. 10%,
20%, etc., or every N files, or every file, show the *increase* in
the number of ngrams.
Obviously building sublists of files with a single file added and
rerunning ngram-count on such lists is inefficient. Is it the case
where I have to get into C++ library indeed, and which classes should
I use? Basically, I want to know which *new* ngrams are contributed
by a given file, in the sequence of processing. I may want to output
them separately for look-see, too.
More information about the SRILM-User