[SRILM User List] nan in language model

Rico Sennrich rico.sennrich at gmx.ch
Tue May 15 03:19:14 PDT 2012


On Mon, 2012-03-12 at 09:33 -0700, Andreas Stolcke wrote: 
> On 3/12/2012 6:10 AM, Rico Sennrich wrote:
> > Hi list,
> >
> > Occasionally, I get 'nan' as probability or backoff weight in LMs
> > trained with SRILM. This is not expected in an ARPA file and eventually
> > leads to crashes / undefined behaviour in other programs that use the
> > model.
> It's certainly not supposed to happen.
> In your case it looks like 5-grams end up with nan probabilities, which 
> would then lead to BOWs also being computed as NaNs.
> 
> I have never seens this, actually.  It would help to try a few things:

Sorry for the late reply. The short answer is that a negative kndiscount
(discount3+ in biglm.kn5) is the problem. I guess it's a known problem
that Kneser-Ney smoothing behaves weirdly for data with lots of
duplicates, but I'd rather have an error message than for SRILM to
silently build a corrupt LM.

> - see if it only happens with -kndiscount.

with -kndiscount -interpolate I get NaNs (as described before)
with -kndiscount and without -interpolate, the last step (ngram-count)
crashes.
with default smoothing (no smoothing option specified), training seems
to hang up at some point.

There's no errors or warnings in any of these cases.

> - see if those ngram counts have any special properties.

The corpus the models were trained on is the News Crawl corpus
http://www.statmt.org/wmt11/translation-task.html , and there are quite
a few duplicate sentences in this corpus (which explains the negative
kndiscount). The affected ngrams seem to all stem from these duplicate
sentences.

Rico 



More information about the SRILM-User mailing list