[SRILM User List] nan in language model

Fri May 18 22:29:42 PDT 2012

Attached is a patch that catches negative discounts when using make-big-lm .
The discount estimator built into ngram-count (Discount.cc) already had 
this check, but for some reason it was not in the make-kn-discounts script.

Andreas

On 5/15/2012 3:19 AM, Rico Sennrich wrote:
> On Mon, 2012-03-12 at 09:33 -0700, Andreas Stolcke wrote:
>> On 3/12/2012 6:10 AM, Rico Sennrich wrote:
>>> Hi list,
>>>
>>> Occasionally, I get 'nan' as probability or backoff weight in LMs
>>> trained with SRILM. This is not expected in an ARPA file and eventually
>>> leads to crashes / undefined behaviour in other programs that use the
>>> model.
>> It's certainly not supposed to happen.
>> In your case it looks like 5-grams end up with nan probabilities, which
>> would then lead to BOWs also being computed as NaNs.
>>
>> I have never seens this, actually.  It would help to try a few things:
> Sorry for the late reply. The short answer is that a negative kndiscount
> (discount3+ in biglm.kn5) is the problem. I guess it's a known problem
> that Kneser-Ney smoothing behaves weirdly for data with lots of
> duplicates, but I'd rather have an error message than for SRILM to
> silently build a corrupt LM.
>
>> - see if it only happens with -kndiscount.
> with -kndiscount -interpolate I get NaNs (as described before)
> with -kndiscount and without -interpolate, the last step (ngram-count)
> crashes.
> with default smoothing (no smoothing option specified), training seems
> to hang up at some point.
>
> There's no errors or warnings in any of these cases.
>
>> - see if those ngram counts have any special properties.
> The corpus the models were trained on is the News Crawl corpus
> http://www.statmt.org/wmt11/translation-task.html , and there are quite
> a few duplicate sentences in this corpus (which explains the negative
> kndiscount). The affected ngrams seem to all stem from these duplicate
> sentences.
>
> Rico
>

-------------- next part --------------
Index: utils/src/make-kn-discounts.gawk
===================================================================
RCS file: /home/srilm/CVS/srilm/utils/src/make-kn-discounts.gawk,v
retrieving revision 1.4
diff -c -r1.4 make-kn-discounts.gawk
*** utils/src/make-kn-discounts.gawk	17 Jun 2007 01:21:18 -0000	1.4
--- utils/src/make-kn-discounts.gawk	19 May 2012 05:10:19 -0000
***************
*** 95,102 ****

      Y = countOfCounts[1]/(countOfCounts[1] + 2 * countOfCounts[2]);

      print "mincount", min;
!     print "discount1", 1 - 2 * Y * countOfCounts[2] / countOfCounts[1];
!     print "discount2", 2 - 3 * Y * countOfCounts[3] / countOfCounts[2];
!     print "discount3+", 3 - 4 * Y * countOfCounts[4] / countOfCounts[3];
  }
--- 95,114 ----

      Y = countOfCounts[1]/(countOfCounts[1] + 2 * countOfCounts[2]);

+     discount1 = 1 - 2 * Y * countOfCounts[2] / countOfCounts[1];
+     discount2 = 2 - 3 * Y * countOfCounts[3] / countOfCounts[2];
+     discount3plus = 3 - 4 * Y * countOfCounts[4] / countOfCounts[3];
+ 
      print "mincount", min;
!     print "discount1", discount1;
!     print "discount2", discount2;
!     print "discount3+", discount3plus;
! 
!     # check for invalid values after output, so we see where the problem is 
!     if (discount1 < 0 || dicount2 < 0 || discount3plus < 0) {
! 	printf "error: one of modified KneserNey discounts is negative\n" \
! 	       						>> "/dev/stderr";
! 	exit(2);
!     }
! 
  }