[SRILM User List] NgramCountLM Bug?

Ariya Rastrow ariya at jhu.edu
Fri Feb 24 14:35:10 PST 2012

  I had a question about NgramCountLM (Jelinek-Mercer interpolation
method). It seems to me there is a bug with the way the \lambda parameters
are being estimated in the code. The problem is that the expectations for
\lambda's (using EM) are being collected by iterating through N-grams of
the held-out text. However, the count of the N-gram is not being taken into
account for each N-gram (even though for calculating the log-probability of
the held-out the wordProb is being multiplied by the count of the N-gram)
during the call to LM::countsProb(...) by NgramCountLM::estimate(). In
other words, the statistics for \lambda's are being collected as if each
event is a singleton in the held-out data. The fix to this would be to pass
*count from LM::countsProb(...) to NgramCountLM::wordProbTrain(...) such
that the posteriors of \lambda get multiplied by that count.


