[SRILM User List] Interpreting ngram -ppl output in case of backoff

Sander Maijers S.N.Maijers at student.ru.nl
Sat Jun 1 06:00:28 PDT 2013


Andreas, you wrote:
 > If the word is 'a'  and the last two words 'b' and 'c' (in that 
order), and you  have a bigram hit (output says '[2gram]' ),  you'd have 
to look up the bigram log probability for 'a c' and add to that the 
backoff weight for 'b c'.
I interpret this as P(a | b c). If that is correct, then shouldn't I 
actually look up the line in the ARPA LM with "c a" (reverse)?

Can you further comment to clear up my remaining confusion please ...

1. The word "Albert" is not in the word list/vocabulary. Nor are there 
any N-grams with "Albert". This confuses me. I cannot trace the 
appropriate N-gram that lead to the logprob that was reported by 
'ngram'. It seems that I cannot directly see in the 'ngram' output 
if/when there had been any backing off during the LM lookups for this 
sentence. I assume that the word Albert was actually replaced with 
[unk], but in the ngram output, such is not displayed. There also 0 OOVs 
reported, which strikes me as odd. All in all I believe that the 
following N-grams were looked up:

-1.358477	<s> dat	-0.6622628
-1.334724	dat [unk]	-0.0686222
-0.6776069	dat [unk] [unk]

for the 'n gram' output

dat Albert Heijn het doet zou niet de aanleiding zijn	
    p( dat | <s> ) = 0.0438046 [ -1.35848 ]
    p( Albert | dat ...) = 0.0100695 [ -1.99699 ]
    p( Heijn | Albert ...) = 0.210084 [ -0.677607 ]

How did 'ngram' come to the -1.99699 logprob?


Extra suggestions:
A. Just now I saw the 'ngram' option '-limit-vocab': 'The default is 
that words used in the LM are automatically added to the vocabulary.' I 
would say that not doing this and restricting the known words to the 
ones in the vocabulary is a more sensible default, because this 
behaviour defeats an important point for specifying a vocabulary 
(controlling the lookup in the LM). But anyway, could you remark this 
behavior under the description of '-vocab' as well?

B. I think it would be useful if something like the line number in the 
ARPA LM of the N-gram that was retrieved is listed on each p( ... ) line 
in the 'ngram' output, if need be only at '-debug 4' level. Be it a line 
number, or some other definite key/index to the N-grams that is 
automatically parseable from the output.








More information about the SRILM-User mailing list