[SRILM User List] A problem with expanding class-based LMs

Thu Dec 22 15:19:49 PST 2011

The problem turns out to be a sensitivity in the backoff computation to
sums of probabilities that are exactly zero versus numerically equal
to zero (less than Prob_Epsilon).  In your case, the sum of unigram 
probs of the expanded LM is sometimes very slightly less than 1, causing
some probabily mass to be distributed over all the unseen words, and 
the perplexity to be changed noticeably.   

The patch below will catch these cases and produce consistent results independent
of these small numerical differences (which result from probabilties being summed
in different order, depending on whether the iteration is over sorted arrays or 
hash tables).

Andreas

diff -c -r1.122 NgramLM.cc
*** lm/src/NgramLM.cc	30 May 2011 23:46:38 -0000	1.122
--- lm/src/NgramLM.cc	22 Dec 2011 22:27:58 -0000
***************
*** 2118,2125 ****
  	     * unigrams, which we achieve by giving them zero probability.
  	     */
  	    if (order == 0 /*&& numerator > 0.0*/) {
  		distributeProb(numerator, context);
! 	    } else if (numerator == 0.0 && denominator == 0.0) {
  		node->bow = LogP_One;
  	    } else {
  		node->bow = ProbToLogP(numerator) - ProbToLogP(denominator);
--- 2118,2131 ----
  	     * unigrams, which we achieve by giving them zero probability.
  	     */
  	    if (order == 0 /*&& numerator > 0.0*/) {
+ 		if (numerator < Prob_Epsilon) {
+ 		    /*
+ 		     * Avoid spurious non-zero unigram probabilities
+ 		     */
+ 		    numerator = 0.0;
+ 		}
  		distributeProb(numerator, context);
! 	    } else if (numerator < Prob_Epsilon && denominator < Prob_Epsilon) {
  		node->bow = LogP_One;
  	    } else {
  		node->bow = ProbToLogP(numerator) - ProbToLogP(denominator);

In message <4EF38A13.7020309 at ovgu.de>you wrote:
> 
> I had repeated expansion with different binaries and got different 
> results again.
> I attached the source files and corresponding scripts to this e-mail. I 
> did not included the expanded models since they are too large, but they 
> are also available.
> 
> I hope this will help you to investigate the problem.
> 
> Sincerely yours,
> Dmytro Prylipko.
> 
> Am 12/22/2011 7:38 PM, schrieb Andreas Stolcke:
> > Dmytro Prylipko wrote:
> >> I tried expansion also on trigrams with the same problem.
> >> Actually I managed to cope it. I compiled the SRILM with the "_c" 
> >> option and expanded my bigrams with that binary. It helped 
> >> (perplexity measures became the same), however in this case another 
> >> bigrams (expanded ok with usual binary) had the problem described 
> >> before. Is it a bug?
> > You should never get different results (other than sorting order, 
> > e.g., in counts files) with the regular and the _c version.
> > Can you send me the inputs involved?
> >
> > Andreas
> >