[SRILM User List] NgramCountLM Bug?

Fri Feb 24 19:42:40 PST 2012

On 2/24/2012 2:35 PM, Ariya Rastrow wrote:
>
> Hi,
>   I had a question about NgramCountLM (Jelinek-Mercer interpolation 
> method). It seems to me there is a bug with the way the \lambda 
> parameters are being estimated in the code. The problem is that the 
> expectations for \lambda's (using EM) are being collected by iterating 
> through N-grams of the held-out text. However, the count of the N-gram 
> is not being taken into account for each N-gram (even though for 
> calculating the log-probability of the held-out the wordProb is being 
> multiplied by the count of the N-gram) during the call 
> to LM::countsProb(...) by NgramCountLM::estimate(). In other words, 
> the statistics for \lambda's are being collected as if each event is a 
> singleton in the held-out data. The fix to this would be to pass 
> *count from LM::countsProb(...) to NgramCountLM::wordProbTrain(...) 
> such that the posteriors of \lambda get multiplied by that count.
>
Good catch!   That is indeed a bug.  Attached is s patch that should do 
the right thing.

Andreas


-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: ngramcountlm.patch
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120224/37f59863/attachment.ksh>