[SRILM User List] vocab size from make-batch-counts

Sun Sep 20 12:03:55 PDT 2009

Hello,

I am wondering about what type of ngram pruning is done in the 
make-batch-counts training script, and if it can be handled with flags. 
I've looked through the code and man pages but I'm not sure whether I can 
pass the right argument.  I discovered that the pruning happens because, 
when I vary the batch size, the resulting vocabulary size changes.  For 
instance, on a small development corpus:

> make-batch-counts files.list 10 xmlfilter.sh counts_10perbatch
> merge-batch-counts counts_10perbatch
> ngram-count -read counts_10perbatch/files.list-1.ngrams.gz -write-vocab 
10perbatch.vocab
> wc 10perbatch.vocab
   2763  2763 32999 10perbatch.vocab

> make-batch-counts files.list 1 xmlfilter.sh counts_1perbatch
> merge-batch-counts counts_1perbatch
> ngram-count -read counts_1perbatch/merge-iter2-1.ngrams.gz -write-vocab 
1perbatch.vocab
> wc 1perbatch.vocab
   5923  5923 72237 1perbatch.vocab

Same sort of result when I use a larger corpus or other batch sizes; the 
vocab decreases with an increase in the size of the batch.  I have tried 
experimenting with -gtmin to change the output, without success.  I'm 
confused as to why batch size would make a difference here.

I am using version 1.5.5.

Thanks,
Ilana

Ilana Heintz
Department of Linguistics
Ohio State University
http://www.ling.ohio-state.edu/~bromberg