[SRILM User List] Interpreting ngram -ppl output in case of backoff

Thu May 30 08:38:24 PDT 2013

Hi,

I have trained a baseline N-gram LM like so:
vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s 
-sort -lm %s

Suppose I have the following line to ngram -ppl -debug 3 -map-unk [unk] 
... :
( Heijn | Albert ...) = 0.210084 [ -0.677607 ]

This bigram is not in my LM. My pronunciation lexicon contains both 
words, but only in lower case. I believe that the bigram that would be 
looked up in this case by ngram is the one for "[unk] [unk]":

-0.5549474	[unk] [unk]	-0.2222121

I do not understand precisely how to confirm this with the logprob 
between brackets reported by ngram. When the applicable N-gram *is* in 
the LM, the logprobs do not match between the ARPA line and the ngram 
output either, but this must be due to discounting applied by default. 
The man page for ngram with arguments -debug 2 -ppl says:
"Probabilities for each word, plus LM-dependent details about backoff 
used etc., are printed.".

Where should I look for the backoff details in my ngram output to asses 
the role of backoff, including the backing off as happening in LMs 
generated with the -skip option?

Best,
Sander