question about warning message

Andreas Stolcke stolcke at speech.sri.com
Wed May 23 20:28:29 PDT 2001


Sarah,

this discrepancy was indeed caused by the different floating point precision on 
x86 machines.  To check for an anomaly of the counts-of-counts in Good-Turing
discounting the code was checking whether two numbers were the same.  This test
turned out true on the Sparc machine, but false on Intel-based CPUs (they were
ever-so-slightly off due to the extra bits in x86 floating point registers).
The patch below fixes this problem and makes the behavior consistent (apply it to
Discount.cc and rebuild the Linux version).  It is really annoying that Intel
couldn't just implement standard-precision IEEE arithmetic...

Beyond that however, you should use a higher threshold for unigram discounting 
to avoid the problem of anomalous (non-smooth) counts-of-counts in the first place.
Try "-gt1min 5".

*** /tmp/T00vP_Q5	Wed May 23 20:18:39 2001
--- Discount.cc	Wed May 23 20:02:53 2001
***************
*** 185,197 ****
  	    } else {
  		double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
  					    (i * (double)countOfCounts[i]);
! 		if (coeff0 <= commonTerm || coeff0 > 1.0) {
  		    cerr << "warning: discount coeff " << i
! 			 << " is out of range: " << coeff0 << "\n";
  		    coeff = 1.0;
- 		} else {
- 		    coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
- 
  		}
  	    }
  	    discountCoeffs[i] = coeff;
--- 185,195 ----
  	    } else {
  		double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
  					    (i * (double)countOfCounts[i]);
! 		coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
! 		if (coeff <= Prob_Epsilon || coeff0 > 1.0) {
  		    cerr << "warning: discount coeff " << i
! 			 << " is out of range: " << coeff << "\n";
  		    coeff = 1.0;
  		}
  	    }
  	    discountCoeffs[i] = coeff;

--Andreas

In message <Pine.LNX.4.21.0105231331160.3828-100000 at titanium.cs.washington.edu>you wrote:
> hi all,
> 
> I am running SRILM 1.0.1 on two different platforms (linux and
> solaris) and got different results using the same data with exactly the
> same commands.  I'm hoping that someone else might have some insight...  
> 
> I'm not doing anything fancy - in this case, I just used ngram-count to
> build a trigram lm using the default settings for GT discounting, etc.  
> Still, I get noticably different results ( ppl= 18.0975 ppl1= 40.7525 in
> linux and  ppl= 17.2411 ppl1= 38.3 in solaris )
> 
> The solaris version gives the following warning, but the linux version
> does not:
>  warning: discount coeff 1 is out of range: 0.900585
> 
> I turned on the -debug 3 flag to get more information, and the output of
> the two versions are nearly identical.  The differences are the warning
> above, also, one verision discards 1 1-gram prob prdeicting a pseudo-event
> while the other discards 2, and in the end, they have very different
> left-over probability masses ( 0.00388768 vs.  4.55956e-06, where the
> second number corresponds with the warning I quoted above )
> although they distribute these over the same number of
> unseen events and write the same number of n-grams.  The GT-count numbers
> are also all the same in both versions.
> 
> I found the warning message in the code (in lm/src/Discount.cc) but I
> don't really understand what's causing it, and I certainly don't
> understand why I get it on one installation and not the other.  If anyone
> has any insight to offer, I'd greatly appreciate it. 
> 
> thanks much,
> Sarah
> 
> ________________________
> Sarah Schwarm
> sarahs at cs.washington.edu
> 




More information about the SRILM-User mailing list