[SRILM User List] nan in language model

Mon Mar 12 09:33:15 PDT 2012

On 3/12/2012 6:10 AM, Rico Sennrich wrote:
> Hi list,
>
> Occasionally, I get 'nan' as probability or backoff weight in LMs
> trained with SRILM. This is not expected in an ARPA file and eventually
> leads to crashes / undefined behaviour in other programs that use the
> model.
It's certainly not supposed to happen.
In your case it looks like 5-grams end up with nan probabilities, which 
would then lead to BOWs also being computed as NaNs.

I have never seens this, actually.  It would help to try a few things:

- see if it only happens with -kndiscount.
- try to elicit the problem with a smaller amount of input data (e.g., 
including only the ngrams that have the NaN's in the probabilities)
- see if those ngram counts have any special properties.

Andreas

>
> Here's some statistics:
>
> \data\
> ngram 1=2054819
> ngram 2=40441708
> ngram 3=187680929
> ngram 4=382878635
> ngram 5=519867931
>
> probability nan:
> 1 0
> 2 0
> 3 0
> 4 0
> 5 1233183
>
> backoff nan:
> 1 0
> 2 0
> 3 0
> 4 415865
> 5 0
>
>
> Here's the training parameters:
>
> make-batch-counts file-list.txt 10 cat /wrk/smt/tmp -order 5
>
> make-big-lm -kndiscount -interpolate -order 5 -read \
> tmp/file-list.txt-1.ngrams.gz -unk -lm hugelm.gz
>
> This happened with SRILM 1.5.9 and 1.6.0-beta, and stderr didn't show
> any errors/warnings.
>
> best wishes,
> Rico
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user