[SRILM User List] counts in ngram-count output

Andreas Stolcke stolcke at icsi.berkeley.edu
Fri Jul 20 08:55:42 PDT 2012


On 7/19/2012 6:47 PM, shinichiro.hamada wrote:
> Hi, I have a question if my outputs of ngram-count are correct or not.
>
> I made a fractional word-count file by my own program and executed
> ngram-count command with wb discount. The header of outputs were
> bellow:
>
> --------------------------
> [4gram wb float-count]
> ngram-count -read countfile_float -float-counts -order 4 -lm outfile \
>   -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram 1=780387
> ngram 2=20321
> ngram 3=2692
> ngram 4=2622
> ..
> --------------------------
>
> I thought higher order models have always more counts than lower
> order ones, but the above result wasn't so. Does this result
> designate that my word-count file has bug?
This is probably because the defaults for minimum count frequency are 
higher for trigrams and 4grams than for bigrams.
For bigrams it is 1, whereas for 3grams and higher it is 2.  You should 
see the expected behavior if you add

-gt3min 1 -gt4min 1

to the options.  (As explained in the man page, -gtXmin options apply to 
all discounting methods, not just GT.)

Andreas



More information about the SRILM-User mailing list