[SRILM User List] counts in ngram-count output

shinichiro.hamada shinichiro.hamada at gmail.com
Tue Jul 24 11:05:08 PDT 2012


I haven't understood the specifications of the options.
Thank you very much for pointing it out. I'll try it.

Best regards,
Shinichiro

> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] 
> Sent: Saturday, July 21, 2012 12:56 AM
> To: shinichiro.hamada
> Cc: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] counts in ngram-count output
> 
> On 7/19/2012 6:47 PM, shinichiro.hamada wrote:
> > Hi, I have a question if my outputs of ngram-count are 
> correct or not.
> >
> > I made a fractional word-count file by my own program and executed 
> > ngram-count command with wb discount. The header of outputs were
> > bellow:
> >
> > --------------------------
> > [4gram wb float-count]
> > ngram-count -read countfile_float -float-counts -order 4 
> -lm outfile \
> >   -wbdiscount -wbdiscount1 -wbdiscount2 -wbdiscount3
> >
> > ngram 1=780387
> > ngram 2=20321
> > ngram 3=2692
> > ngram 4=2622
> > ..
> > --------------------------
> >
> > I thought higher order models have always more counts than 
> lower order 
> > ones, but the above result wasn't so. Does this result 
> designate that 
> > my word-count file has bug?
> This is probably because the defaults for minimum count 
> frequency are higher for trigrams and 4grams than for bigrams.
> For bigrams it is 1, whereas for 3grams and higher it is 2.  
> You should see the expected behavior if you add
> 
> -gt3min 1 -gt4min 1
> 
> to the options.  (As explained in the man page, -gtXmin 
> options apply to all discounting methods, not just GT.)
> 
> Andreas



More information about the SRILM-User mailing list