[SRILM User List] LM whose counts are multiplied

Thu Feb 2 06:10:02 PST 2012

Dear Mr. Stolcke, 

Thank you for your clear explanation.
I understood it completely!!
I'll try to use WB discounting method with float-counts.

Shincihiro Hamada

> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu] 
> Sent: Thursday, February 02, 2012 5:52 AM
> To: shinichiro.hamada
> Cc: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] LM whose counts are multiplied
> 
> On 2/1/2012 7:01 AM, shinichiro.hamada wrote:
> > Hello, all.
> >
> > I want to make a language model with data which have 
> fraction counts. 
> > But not all smoothing method can handle them, so I'll try 
> to multiply 
> > each count by 10 and make it integer by rounding.
> >
> > --
> > I did a preliminary experiment.
> >
> > Files:
> > * count-file with integers : a.count
> > * the file whose counts are multiplied by 10 : b.count
> >
> > Command:
> > ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1
> > -wbdiscount2 -wbdiscount3 -interpolate ngram-count -read b.count 
> > -order 3 -lm b.lm -wbdiscount -wbdiscount1
> > -wbdiscount2 -wbdiscount3 -interpolate
> >
> > I expected same language models are generated, but they 
> differed. Why?
> > Followings are their heading parts.
> 
> First off, the WB discounting method does support fractional 
> counts, so you can just feed your counts to ngram -float-counts ...
> with no need to scale and truncate the counts to integers.
> 
> The reason you are seeing different LM outputs for different 
> count multipliers is that smoothing is sensitive to the 
> absolute occurrence counts of ngrams, not just their relative 
> frequencies.  This has to be so, if you're trying to estimate 
> the probabilities of unseen ngrams.  If you've seen only 10 
> cases "a b" and never saw "a b x" you should be less 
> surprised to see your first "a b x",  than if you had seen 
> 1000 instances of "a b" (and still none of "a b x").
> 
> Andreas