[SRILM User List] LM whose counts are multiplied
Andreas Stolcke
stolcke at icsi.berkeley.edu
Wed Feb 1 12:51:40 PST 2012
On 2/1/2012 7:01 AM, shinichiro.hamada wrote:
> Hello, all.
>
> I want to make a language model with data which have fraction counts. But
> not all smoothing method can handle them, so I'll try to multiply each count
> by 10 and make it integer by rounding.
>
> --
> I did a preliminary experiment.
>
> Files:
> * count-file with integers : a.count
> * the file whose counts are multiplied by 10 : b.count
>
> Command:
> ngram-count -read a.count -order 3 -lm a.lm -wbdiscount -wbdiscount1
> -wbdiscount2 -wbdiscount3 -interpolate
> ngram-count -read b.count -order 3 -lm b.lm -wbdiscount -wbdiscount1
> -wbdiscount2 -wbdiscount3 -interpolate
>
> I expected same language models are generated, but they differed. Why?
> Followings are their heading parts.
First off, the WB discounting method does support fractional counts, so
you can just feed your counts to
ngram -float-counts ...
with no need to scale and truncate the counts to integers.
The reason you are seeing different LM outputs for different count
multipliers is that smoothing is sensitive to the absolute occurrence
counts of ngrams, not just their relative frequencies. This has to be
so, if you're trying to estimate the probabilities of unseen ngrams. If
you've seen only 10 cases "a b" and never saw "a b x" you should be less
surprised to see your first "a b x", than if you had seen 1000
instances of "a b" (and still none of "a b x").
Andreas
More information about the SRILM-User
mailing list