[SRILM User List] counts in ngram-count output
Andreas Stolcke
stolcke at icsi.berkeley.edu
Fri Jul 20 08:55:42 PDT 2012
On 7/19/2012 6:47 PM, shinichiro.hamada wrote:
> Hi, I have a question if my outputs of ngram-count are correct or not.
>
> I made a fractional word-count file by my own program and executed
> ngram-count command with wb discount. The header of outputs were
> bellow:
>
> --------------------------
> [4gram wb float-count]
> ngram-count -read countfile_float -float-counts -order 4 -lm outfile \
> -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram 1=780387
> ngram 2=20321
> ngram 3=2692
> ngram 4=2622
> ..
> --------------------------
>
> I thought higher order models have always more counts than lower
> order ones, but the above result wasn't so. Does this result
> designate that my word-count file has bug?
This is probably because the defaults for minimum count frequency are
higher for trigrams and 4grams than for bigrams.
For bigrams it is 1, whereas for 3grams and higher it is 2. You should
see the expected behavior if you add
-gt3min 1 -gt4min 1
to the options. (As explained in the man page, -gtXmin options apply to
all discounting methods, not just GT.)
Andreas
More information about the SRILM-User
mailing list