Andreas Stolcke stolcke at icsi.berkeley.edu
Wed Feb 1 12:51:40 PST 2012

On 2/1/2012 7:01 AM, shinichiro.hamada wrote:
> Hello, all.
> I want to make a language model with data which have fraction counts. But
> not all smoothing method can handle them, so I'll try to multiply each count
> by 10 and make it integer by rounding.
> --
> I did a preliminary experiment.
> Files:
> * count-file with integers : a.count
> * the file whose counts are multiplied by 10 : b.count
> Command:
> ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1
> -wbdiscount2 -wbdiscount3 -interpolate
> ngram-count -read b.count -order 3 -lm b.lm -wbdiscount -wbdiscount1
> -wbdiscount2 -wbdiscount3 -interpolate
> I expected same language models are generated, but they differed. Why?
> Followings are their heading parts.

First off, the WB discounting method does support fractional counts, so 
you can just feed your counts to
ngram -float-counts ...
with no need to scale and truncate the counts to integers.

The reason you are seeing different LM outputs for different count 
multipliers is that smoothing is sensitive to the absolute occurrence 
counts of ngrams, not just their relative frequencies.  This has to be 
so, if you're trying to estimate the probabilities of unseen ngrams.  If 
you've seen only 10 cases "a b" and never saw "a b x" you should be less 
surprised to see your first "a b x",  than if you had seen 1000 
instances of "a b" (and still none of "a b x").


