[SRILM User List] Interpreting ngram -ppl output in case of backoff
Sander Maijers
S.N.Maijers at student.ru.nl
Sat Jun 1 04:37:05 PDT 2013
On 30-05-13 20:14, Andreas Stolcke wrote:
> On 5/30/2013 8:38 AM, Sander Maijers wrote:
>> Hi,
>>
>> I have trained a baseline N-gram LM like so:
>> vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s
>> -sort -lm %s
>>
>> Suppose I have the following line to ngram -ppl -debug 3 -map-unk
>> [unk] ... :
>> ( Heijn | Albert ...) = 0.210084 [ -0.677607 ]
>>
>> This bigram is not in my LM. My pronunciation lexicon contains both
>> words, but only in lower case. I believe that the bigram that would be
>> looked up in this case by ngram is the one for "[unk] [unk]":
>>
>> -0.5549474 [unk] [unk] -0.2222121
>>
>> I do not understand precisely how to confirm this with the logprob
>> between brackets reported by ngram. When the applicable N-gram *is* in
>> the LM, the logprobs do not match between the ARPA line and the ngram
>> output either, but this must be due to discounting applied by default.
>> The man page for ngram with arguments -debug 2 -ppl says:
>> "Probabilities for each word, plus LM-dependent details about backoff
>> used etc., are printed.".
>>
>> Where should I look for the backoff details in my ngram output to
>> asses the role of backoff, including the backing off as happening in
>> LMs generated with the -skip option?
>
> You won't see all the details of the backoff computation in the ppl output.
> If the word is 'a' and the last two words 'b' and 'c' (in that order),
> and you have a bigram hit (output says '[2gram]' ), you'd have to look
> up the bigram log probability for 'a c' and add to that the backoff
> weight for 'b c'. Unfortunately only one word of history is printed
> (to keep things brief), so for trigrams and higher models you need to
> extract the history from the complete sentence string.
>
> Andreas
>
On 30-05-13 20:14, Andreas Stolcke wrote:> On 5/30/2013 8:38 AM, Sander
Maijers wrote:
>> Hi,
>>
>> I have trained a baseline N-gram LM like so:
>> vocab %s -unk -map-unk '[unk]' -prune %s -debug 1 -order 3 -text %s
>> -sort -lm %s
>>
>> Suppose I have the following line to ngram -ppl -debug 3 -map-unk
>> [unk] ... :
>> ( Heijn | Albert ...) = 0.210084 [ -0.677607 ]
>>
>> This bigram is not in my LM. My pronunciation lexicon contains both
>> words, but only in lower case. I believe that the bigram that would be
>> looked up in this case by ngram is the one for "[unk] [unk]":
>>
>> -0.5549474 [unk] [unk] -0.2222121
>>
>> I do not understand precisely how to confirm this with the logprob
>> between brackets reported by ngram. When the applicable N-gram *is* in
>> the LM, the logprobs do not match between the ARPA line and the ngram
>> output either, but this must be due to discounting applied by default.
>> The man page for ngram with arguments -debug 2 -ppl says:
>> "Probabilities for each word, plus LM-dependent details about backoff
>> used etc., are printed.".
>>
>> Where should I look for the backoff details in my ngram output to
>> asses the role of backoff, including the backing off as happening in
>> LMs generated with the -skip option?
>
> You won't see all the details of the backoff computation in the ppl
output.
> If the word is 'a' and the last two words 'b' and 'c' (in that order),
> and you have a bigram hit (output says '[2gram]' ), you'd have to look
> up the bigram log probability for 'a c' and add to that the backoff
> weight for 'b c'. Unfortunately only one word of history is printed
> (to keep things brief), so for trigrams and higher models you need to
> extract the history from the complete sentence string.
>
> Andreas
>
Thank you. Could you explain the following example? How do I interpret
this snippet of ngram output:
dat Albert Heijn het doet zou niet de aanleiding zijn
p( dat | <s> ) = 0.0438046 [ -1.35848 ]
p( Albert | dat ...) = 0.0100695 [ -1.99699 ]
What is the order of these first two N-grams retrieved? There is no line
with [2gram] in the output. However, the first N-gram has no ellipsis in
the history and the second line has.
More information about the SRILM-User
mailing list