[SRILM User List] Interpreting ngram -ppl output in case of backoff

Sat Jun 1 04:37:05 PDT 2013

On 30-05-13 20:14, Andreas Stolcke wrote:
> On 5/30/2013 8:38 AM, Sander Maijers wrote:
>> Hi,
>>
>> I have trained a baseline N-gram LM like so:
>> vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s
>> -sort -lm %s
>>
>> Suppose I have the following line to ngram -ppl -debug 3 -map-unk
>> [unk] ... :
>> ( Heijn | Albert ...) = 0.210084 [ -0.677607 ]
>>
>> This bigram is not in my LM. My pronunciation lexicon contains both
>> words, but only in lower case. I believe that the bigram that would be
>> looked up in this case by ngram is the one for "[unk] [unk]":
>>
>> -0.5549474    [unk] [unk]    -0.2222121
>>
>> I do not understand precisely how to confirm this with the logprob
>> between brackets reported by ngram. When the applicable N-gram *is* in
>> the LM, the logprobs do not match between the ARPA line and the ngram
>> output either, but this must be due to discounting applied by default.
>> The man page for ngram with arguments -debug 2 -ppl says:
>> "Probabilities for each word, plus LM-dependent details about backoff
>> used etc., are printed.".
>>
>> Where should I look for the backoff details in my ngram output to
>> asses the role of backoff, including the backing off as happening in
>> LMs generated with the -skip option?
>
> You won't see all the details of the backoff computation in the ppl output.
> If the word is 'a'  and the last two words 'b' and 'c' (in that order),
> and you  have a bigram hit (output says '[2gram]' ),  you'd have to look
> up the bigram log probability for 'a c' and add to that the backoff
> weight for 'b c'.    Unfortunately only one word of history is printed
> (to keep things brief), so for trigrams and higher models you need to
> extract the history from the complete sentence string.
>
> Andreas
>

On 30-05-13 20:14, Andreas Stolcke wrote:> On 5/30/2013 8:38 AM, Sander 
Maijers wrote:
 >> Hi,
 >>
 >> I have trained a baseline N-gram LM like so:
 >> vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s
 >> -sort -lm %s
 >>
 >> Suppose I have the following line to ngram -ppl -debug 3 -map-unk
 >> [unk] ... :
 >> ( Heijn | Albert ...) = 0.210084 [ -0.677607 ]
 >>
 >> This bigram is not in my LM. My pronunciation lexicon contains both
 >> words, but only in lower case. I believe that the bigram that would be
 >> looked up in this case by ngram is the one for "[unk] [unk]":
 >>
 >> -0.5549474    [unk] [unk]    -0.2222121
 >>
 >> I do not understand precisely how to confirm this with the logprob
 >> between brackets reported by ngram. When the applicable N-gram *is* in
 >> the LM, the logprobs do not match between the ARPA line and the ngram
 >> output either, but this must be due to discounting applied by default.
 >> The man page for ngram with arguments -debug 2 -ppl says:
 >> "Probabilities for each word, plus LM-dependent details about backoff
 >> used etc., are printed.".
 >>
 >> Where should I look for the backoff details in my ngram output to
 >> asses the role of backoff, including the backing off as happening in
 >> LMs generated with the -skip option?
 >
 > You won't see all the details of the backoff computation in the ppl 
output.
 > If the word is 'a'  and the last two words 'b' and 'c' (in that order),
 > and you  have a bigram hit (output says '[2gram]' ),  you'd have to look
 > up the bigram log probability for 'a c' and add to that the backoff
 > weight for 'b c'.    Unfortunately only one word of history is printed
 > (to keep things brief), so for trigrams and higher models you need to
 > extract the history from the complete sentence string.
 >
 > Andreas
 >

Thank you. Could you explain the following example? How do I interpret 
this snippet of ngram output:

dat Albert Heijn het doet zou niet de aanleiding zijn
p( dat | <s> ) = 0.0438046 [ -1.35848 ]
p( Albert | dat ...) = 0.0100695 [ -1.99699 ]

What is the order of these first two N-grams retrieved? There is no line 
with [2gram] in the output. However, the first N-gram has no ellipsis in 
the history and the second line has.