SRILM nan probablities in language models.

Tue Dec 9 13:04:43 PST 2003

In message <!~!UENERkVCMDkAAQACAAAAAAAAAAAAAAAAABgAAAAAAAAArYwhjGp41hGnTADQt1Ft
L8KAAAAQAAAAcuoCVreIRkKbqeqyuchROAEAAAAA at cs.technion.ac.il>you wrote:
> This is a multi-part message in MIME format.
> 
> ------=_NextPart_000_0044_01C3BE5C.DE44B7E0
> Content-Type: text/plain;
> 	charset="US-ASCII"
> Content-Transfer-Encoding: 7bit
> 
> Dear Mr. Stolcke,
> 
> I'm a CS M.Sc. student in the Technion, Israel. I've been using SRILM
> during the last few months for tagging Hebrew. 
> 
> I use ngram-count for creating language models. I tried it on 5
> randomally created test sets.
> In 4 out of 5, the language model was created successfully, although I
> got warnings such as:
> 
> warning: discount coeff X is out of range: Y (two warnings or so for
> each file).
> 
> But for one set I get for all the unigrams a "nan" probability (but the
> bigram probablities seem OK), and of course, disambig performs poorly
> with this language model.
> I have no idea what is difference between this text file and the others.
> 
> 
> The command line I used:
> 
> ngram-count -order 2 -text train.tagseq -lm train.lm.bigram
> 
> I attached the input text file, the output language model, and the debug
> messages I got (with -debug 1)
> 
> I would be very grateful if you could help me find out what the problem
> is and how I can solve it.

This is bug in the GT discounting method that was fixed recently.
A quick patch is included below.  It will also be fixed in the next release.

(You will have to apply this patch by hand since the RCS IDs differ from 
your version.)

--Andreas 

*** /tmp/T00Q9yJ8	Tue Dec  9 13:01:25 2003
--- Discount.cc	Tue Nov 11 11:35:29 2003
***************
*** 5,14 ****
   */

  #ifndef lint
! static char Copyright[] = "Copyright (c) 1995-2002 SRI International.  All Rights Reserved.";
! static char RcsId[] = "@(#)$Header: /home/srilm/devel/lm/src/RCS/Discount.cc,v 1.18 2003/08/03 18:52:54 stolcke Exp $";
  #endif

  #include "Discount.h"

  #include "Array.cc"
--- 5,19 ----
   */

  #ifndef lint
! static char Copyright[] = "Copyright (c) 1995-2003 SRI International.  All Rights Reserved.";
! static char RcsId[] = "@(#)$Header: /home/srilm/devel/lm/src/RCS/Discount.cc,v 1.19 2003/11/11 19:35:20 stolcke Exp $";
  #endif

+ #include <math.h>
+ #if defined(sun) || defined(sgi)
+ #include <ieeefp.h>
+ #endif
+ 
  #include "Discount.h"

  #include "Array.cc"
***************
*** 193,199 ****
  		double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
  					    (i * (double)countOfCounts[i]);
  		coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
! 		if (coeff <= Prob_Epsilon || coeff0 > 1.0) {
  		    cerr << "warning: discount coeff " << i
  			 << " is out of range: " << coeff << "\n";
  		    coeff = 1.0;
--- 198,204 ----
  		double coeff0 = (i + 1) * (double)countOfCounts[i+1] /
  					    (i * (double)countOfCounts[i]);
  		coeff = (coeff0 - commonTerm) / (1.0 - commonTerm);
! 		if (!finite(coeff) || coeff <= Prob_Epsilon || coeff0 > 1.0) {
  		    cerr << "warning: discount coeff " << i
  			 << " is out of range: " << coeff << "\n";
  		    coeff = 1.0;