[SRILM User List] LM whose counts are multiplied
shinichiro.hamada at gmail.com
Thu Feb 2 06:10:02 PST 2012
Dear Mr. Stolcke,
Thank you for your clear explanation.
I understood it completely!!
I'll try to use WB discounting method with float-counts.
> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at icsi.berkeley.edu]
> Sent: Thursday, February 02, 2012 5:52 AM
> To: shinichiro.hamada
> Cc: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] LM whose counts are multiplied
> On 2/1/2012 7:01 AM, shinichiro.hamada wrote:
> > Hello, all.
> > I want to make a language model with data which have
> fraction counts.
> > But not all smoothing method can handle them, so I'll try
> to multiply
> > each count by 10 and make it integer by rounding.
> > --
> > I did a preliminary experiment.
> > Files:
> > * count-file with integers : a.count
> > * the file whose counts are multiplied by 10 : b.count
> > Command:
> > ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1
> > -wbdiscount2 -wbdiscount3 -interpolate ngram-count -read b.count
> > -order 3 -lm b.lm -wbdiscount -wbdiscount1
> > -wbdiscount2 -wbdiscount3 -interpolate
> > I expected same language models are generated, but they
> differed. Why?
> > Followings are their heading parts.
> First off, the WB discounting method does support fractional
> counts, so you can just feed your counts to ngram -float-counts ...
> with no need to scale and truncate the counts to integers.
> The reason you are seeing different LM outputs for different
> count multipliers is that smoothing is sensitive to the
> absolute occurrence counts of ngrams, not just their relative
> frequencies. This has to be so, if you're trying to estimate
> the probabilities of unseen ngrams. If you've seen only 10
> cases "a b" and never saw "a b x" you should be less
> surprised to see your first "a b x", than if you had seen
> 1000 instances of "a b" (and still none of "a b x").
More information about the SRILM-User