[SRILM User List] Interpreting ngram -ppl output in case of backoff

Thu May 30 11:14:54 PDT 2013

On 5/30/2013 8:38 AM, Sander Maijers wrote:
> Hi,
>
> I have trained a baseline N-gram LM like so:
> vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s 
> -sort -lm %s
>
> Suppose I have the following line to ngram -ppl -debug 3 -map-unk 
> [unk] ... :
> ( Heijn | Albert ...) = 0.210084 [ -0.677607 ]
>
> This bigram is not in my LM. My pronunciation lexicon contains both 
> words, but only in lower case. I believe that the bigram that would be 
> looked up in this case by ngram is the one for "[unk] [unk]":
>
> -0.5549474    [unk] [unk]    -0.2222121
>
> I do not understand precisely how to confirm this with the logprob 
> between brackets reported by ngram. When the applicable N-gram *is* in 
> the LM, the logprobs do not match between the ARPA line and the ngram 
> output either, but this must be due to discounting applied by default. 
> The man page for ngram with arguments -debug 2 -ppl says:
> "Probabilities for each word, plus LM-dependent details about backoff 
> used etc., are printed.".
>
> Where should I look for the backoff details in my ngram output to 
> asses the role of backoff, including the backing off as happening in 
> LMs generated with the -skip option?

You won't see all the details of the backoff computation in the ppl output.
If the word is 'a'  and the last two words 'b' and 'c' (in that order), 
and you  have a bigram hit (output says '[2gram]' ),  you'd have to look 
up the bigram log probability for 'a c' and add to that the backoff 
weight for 'b c'.    Unfortunately only one word of history is printed 
(to keep things brief), so for trigrams and higher models you need to 
extract the history from the complete sentence string.

Andreas