[SRILM User List] NgramCountLM Bug?
stolcke at icsi.berkeley.edu
Fri Feb 24 19:42:40 PST 2012
On 2/24/2012 2:35 PM, Ariya Rastrow wrote:
> I had a question about NgramCountLM (Jelinek-Mercer interpolation
> method). It seems to me there is a bug with the way the \lambda
> parameters are being estimated in the code. The problem is that the
> expectations for \lambda's (using EM) are being collected by iterating
> through N-grams of the held-out text. However, the count of the N-gram
> is not being taken into account for each N-gram (even though for
> calculating the log-probability of the held-out the wordProb is being
> multiplied by the count of the N-gram) during the call
> to LM::countsProb(...) by NgramCountLM::estimate(). In other words,
> the statistics for \lambda's are being collected as if each event is a
> singleton in the held-out data. The fix to this would be to pass
> *count from LM::countsProb(...) to NgramCountLM::wordProbTrain(...)
> such that the posteriors of \lambda get multiplied by that count.
Good catch! That is indeed a bug. Attached is s patch that should do
the right thing.
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
More information about the SRILM-User