[SRILM User List] vocab size from make-batch-counts
Ilana Heintz
heintz.38 at osu.edu
Sun Sep 20 12:03:55 PDT 2009
Hello,
I am wondering about what type of ngram pruning is done in the
make-batch-counts training script, and if it can be handled with flags.
I've looked through the code and man pages but I'm not sure whether I can
pass the right argument. I discovered that the pruning happens because,
when I vary the batch size, the resulting vocabulary size changes. For
instance, on a small development corpus:
> make-batch-counts files.list 10 xmlfilter.sh counts_10perbatch
> merge-batch-counts counts_10perbatch
> ngram-count -read counts_10perbatch/files.list-1.ngrams.gz -write-vocab
10perbatch.vocab
> wc 10perbatch.vocab
2763 2763 32999 10perbatch.vocab
> make-batch-counts files.list 1 xmlfilter.sh counts_1perbatch
> merge-batch-counts counts_1perbatch
> ngram-count -read counts_1perbatch/merge-iter2-1.ngrams.gz -write-vocab
1perbatch.vocab
> wc 1perbatch.vocab
5923 5923 72237 1perbatch.vocab
Same sort of result when I use a larger corpus or other batch sizes; the
vocab decreases with an increase in the size of the batch. I have tried
experimenting with -gtmin to change the output, without success. I'm
confused as to why batch size would make a difference here.
I am using version 1.5.5.
Thanks,
Ilana
Ilana Heintz
Department of Linguistics
Ohio State University
http://www.ling.ohio-state.edu/~bromberg
More information about the SRILM-User
mailing list