[SRILM User List] counts in ngram-count output
shinichiro.hamada
shinichiro.hamada at gmail.com
Tue Jul 24 11:05:08 PDT 2012
I haven't understood the specifications of the options.
Thank you very much for pointing it out. I'll try it.
Best regards,
Shinichiro
> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu]
> Sent: Saturday, July 21, 2012 12:56 AM
> To: shinichiro.hamada
> Cc: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] counts in ngram-count output
>
> On 7/19/2012 6:47 PM, shinichiro.hamada wrote:
> > Hi, I have a question if my outputs of ngram-count are
> correct or not.
> >
> > I made a fractional word-count file by my own program and executed
> > ngram-count command with wb discount. The header of outputs were
> > bellow:
> >
> > --------------------------
> > [4gram wb float-count]
> > ngram-count -read countfile_float -float-counts -order 4
> -lm outfile \
> > -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3
> >
> > ngram 1=780387
> > ngram 2=20321
> > ngram 3=2692
> > ngram 4=2622
> > ..
> > --------------------------
> >
> > I thought higher order models have always more counts than
> lower order
> > ones, but the above result wasn't so. Does this result
> designate that
> > my word-count file has bug?
> This is probably because the defaults for minimum count
> frequency are higher for trigrams and 4grams than for bigrams.
> For bigrams it is 1, whereas for 3grams and higher it is 2.
> You should see the expected behavior if you add
>
> -gt3min 1 -gt4min 1
>
> to the options. (As explained in the man page, -gtXmin
> options apply to all discounting methods, not just GT.)
>
> Andreas
More information about the SRILM-User
mailing list