[SRILM User List] nan in language model
Andreas Stolcke
stolcke at icsi.berkeley.edu
Fri May 18 22:29:42 PDT 2012
Attached is a patch that catches negative discounts when using make-big-lm .
The discount estimator built into ngram-count (Discount.cc) already had
this check, but for some reason it was not in the make-kn-discounts script.
Andreas
On 5/15/2012 3:19 AM, Rico Sennrich wrote:
> On Mon, 2012-03-12 at 09:33 -0700, Andreas Stolcke wrote:
>> On 3/12/2012 6:10 AM, Rico Sennrich wrote:
>>> Hi list,
>>>
>>> Occasionally, I get 'nan' as probability or backoff weight in LMs
>>> trained with SRILM. This is not expected in an ARPA file and eventually
>>> leads to crashes / undefined behaviour in other programs that use the
>>> model.
>> It's certainly not supposed to happen.
>> In your case it looks like 5-grams end up with nan probabilities, which
>> would then lead to BOWs also being computed as NaNs.
>>
>> I have never seens this, actually. It would help to try a few things:
> Sorry for the late reply. The short answer is that a negative kndiscount
> (discount3+ in biglm.kn5) is the problem. I guess it's a known problem
> that Kneser-Ney smoothing behaves weirdly for data with lots of
> duplicates, but I'd rather have an error message than for SRILM to
> silently build a corrupt LM.
>
>> - see if it only happens with -kndiscount.
> with -kndiscount -interpolate I get NaNs (as described before)
> with -kndiscount and without -interpolate, the last step (ngram-count)
> crashes.
> with default smoothing (no smoothing option specified), training seems
> to hang up at some point.
>
> There's no errors or warnings in any of these cases.
>
>> - see if those ngram counts have any special properties.
> The corpus the models were trained on is the News Crawl corpus
> http://www.statmt.org/wmt11/translation-task.html , and there are quite
> a few duplicate sentences in this corpus (which explains the negative
> kndiscount). The affected ngrams seem to all stem from these duplicate
> sentences.
>
> Rico
>
-------------- next part --------------
Index: utils/src/make-kn-discounts.gawk
===================================================================
RCS file: /home/srilm/CVS/srilm/utils/src/make-kn-discounts.gawk,v
retrieving revision 1.4
diff -c -r1.4 make-kn-discounts.gawk
*** utils/src/make-kn-discounts.gawk 17 Jun 2007 01:21:18 -0000 1.4
--- utils/src/make-kn-discounts.gawk 19 May 2012 05:10:19 -0000
***************
*** 95,102 ****
Y = countOfCounts[1]/(countOfCounts[1] + 2 * countOfCounts[2]);
print "mincount", min;
! print "discount1", 1 - 2 * Y * countOfCounts[2] / countOfCounts[1];
! print "discount2", 2 - 3 * Y * countOfCounts[3] / countOfCounts[2];
! print "discount3+", 3 - 4 * Y * countOfCounts[4] / countOfCounts[3];
}
--- 95,114 ----
Y = countOfCounts[1]/(countOfCounts[1] + 2 * countOfCounts[2]);
+ discount1 = 1 - 2 * Y * countOfCounts[2] / countOfCounts[1];
+ discount2 = 2 - 3 * Y * countOfCounts[3] / countOfCounts[2];
+ discount3plus = 3 - 4 * Y * countOfCounts[4] / countOfCounts[3];
+
print "mincount", min;
! print "discount1", discount1;
! print "discount2", discount2;
! print "discount3+", discount3plus;
!
! # check for invalid values after output, so we see where the problem is
! if (discount1 < 0 || dicount2 < 0 || discount3plus < 0) {
! printf "error: one of modified KneserNey discounts is negative\n" \
! >> "/dev/stderr";
! exit(2);
! }
!
}
More information about the SRILM-User
mailing list