[SRILM User List] Count-lm reference request

Tue Oct 1 10:24:41 PDT 2013

On 9/30/2013 10:46 PM, E wrote:
> Hello,
>
> I'm trying to understand the meaning of "google.count.lm0" file as 
> given in FAQ section on creating LM from Web1T corpus. From what I 
> read in Sec 11.4.1 Deleted Interpolation Smoothing in Spoken Language 
> Processing, by Huang et al.
> (equation 11.22) bigram case
>
> P(w_i | w_{i-1}) = \lambda * P_{MLE}(w_i | w_{i-1}) + (1 - \lambda) * 
> P(w_i)
>
> They call \lambda's as the mixture weights. I wonder if they are 
> conceptually the same as the ones used in google.countlm. If so why 
> are they arranged in a 15x5 matrix? Where can I read more about the same?

I don't have access to the book chapter you cite, but from the equation 
it looks like a single fixed interpolation weight is used.

In the SRILM count-lm implementation you have separate lambdas assigned 
to different groups of context ngrams, as a function of the frequency of 
those contexts.  This is what is called "Jelinek-Mercer" smoothing in 
http://acl.ldc.upenn.edu/P/P96/P96-1041.pdf , where the bucketing of the 
contexts is done based on frequency (as suggested in the paper).  The 
specifics are spelled out in the ngram(1) man page.  The relevant bits are:

                    mixweights M
                     w01 w02 ... w0N
                     w11 w12 ... w1N
                     ...
                     wM1 wM2 ... wMN
                    countmodulus m

              M specifies the number of mixture weight bins (minus  
1).   m  is
               the  width  of a mixture weight bin.  Thus, wij is the 
mixture weight used to interpolate an j-th order
               maximum-likelihood estimate with lower-order estimates 
given that the (j-1)-gram context has been  seen
               with  a  frequency between i*m and (i+1)*m-1 times.  (For 
contexts with frequency greater than M*m, the
               i=M weights are used.)

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131001/1803f045/attachment.html>