[SRILM User List] vocab size from make-batch-counts
Andreas Stolcke
stolcke at speech.sri.com
Mon Sep 21 01:07:36 PDT 2009
Ilana Heintz wrote:
> Hello,
>
> I am wondering about what type of ngram pruning is done in the
> make-batch-counts training script, and if it can be handled with
> flags. I've looked through the code and man pages but I'm not sure
> whether I can pass the right argument. I discovered that the pruning
> happens because, when I vary the batch size, the resulting vocabulary
> size changes. For instance, on a small development corpus:
>
>> make-batch-counts files.list 10 xmlfilter.sh counts_10perbatch
>> merge-batch-counts counts_10perbatch
>> ngram-count -read counts_10perbatch/files.list-1.ngrams.gz -write-vocab
What you are doing is not working as intended. make-batch-counts passes
the -write-vocab option to ngram-count,
but each ngram-count invocation will dump only the vocabulary of the
batch it is seeing (hence the result you observed).
To get the combined vocab of your data, run
ngram-count -order 1 -read COUNTS -write-vocab VOCAB
on the final count file.
Andreas
> 10perbatch.vocab
>> wc 10perbatch.vocab
> 2763 2763 32999 10perbatch.vocab
>
>> make-batch-counts files.list 1 xmlfilter.sh counts_1perbatch
>> merge-batch-counts counts_1perbatch
>> ngram-count -read counts_1perbatch/merge-iter2-1.ngrams.gz -write-vocab
> 1perbatch.vocab
>> wc 1perbatch.vocab
> 5923 5923 72237 1perbatch.vocab
>
> Same sort of result when I use a larger corpus or other batch sizes;
> the vocab decreases with an increase in the size of the batch. I have
> tried experimenting with -gtmin to change the output, without
> success. I'm confused as to why batch size would make a difference here.
>
> I am using version 1.5.5.
>
> Thanks,
> Ilana
>
>
> Ilana Heintz
> Department of Linguistics
> Ohio State University
> http://www.ling.ohio-state.edu/~bromberg
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
More information about the SRILM-User
mailing list