[SRILM User List] Interpreting ngram -ppl output in case of backoff
Sander Maijers
S.N.Maijers at student.ru.nl
Sat Jun 1 06:00:28 PDT 2013
Andreas, you wrote:
> If the word is 'a' and the last two words 'b' and 'c' (in that
order), and you have a bigram hit (output says '[2gram]' ), you'd have
to look up the bigram log probability for 'a c' and add to that the
backoff weight for 'b c'.
I interpret this as P(a | b c). If that is correct, then shouldn't I
actually look up the line in the ARPA LM with "c a" (reverse)?
Can you further comment to clear up my remaining confusion please ...
1. The word "Albert" is not in the word list/vocabulary. Nor are there
any N-grams with "Albert". This confuses me. I cannot trace the
appropriate N-gram that lead to the logprob that was reported by
'ngram'. It seems that I cannot directly see in the 'ngram' output
if/when there had been any backing off during the LM lookups for this
sentence. I assume that the word Albert was actually replaced with
[unk], but in the ngram output, such is not displayed. There also 0 OOVs
reported, which strikes me as odd. All in all I believe that the
following N-grams were looked up:
-1.358477 <s> dat -0.6622628
-1.334724 dat [unk] -0.0686222
-0.6776069 dat [unk] [unk]
for the 'n gram' output
dat Albert Heijn het doet zou niet de aanleiding zijn
p( dat | <s> ) = 0.0438046 [ -1.35848 ]
p( Albert | dat ...) = 0.0100695 [ -1.99699 ]
p( Heijn | Albert ...) = 0.210084 [ -0.677607 ]
How did 'ngram' come to the -1.99699 logprob?
Extra suggestions:
A. Just now I saw the 'ngram' option '-limit-vocab': 'The default is
that words used in the LM are automatically added to the vocabulary.' I
would say that not doing this and restricting the known words to the
ones in the vocabulary is a more sensible default, because this
behaviour defeats an important point for specifying a vocabulary
(controlling the lookup in the LM). But anyway, could you remark this
behavior under the description of '-vocab' as well?
B. I think it would be useful if something like the line number in the
ARPA LM of the N-gram that was retrieved is listed on each p( ... ) line
in the 'ngram' output, if need be only at '-debug 4' level. Be it a line
number, or some other definite key/index to the N-grams that is
automatically parseable from the output.
More information about the SRILM-User
mailing list