interpreting -order and -debug results
Alexy Khrabrov
deliverable at gmail.com
Sun Nov 30 19:41:17 PST 2008
Greetings -- I've trained a Kneser-Ney model of a Russian corpus with -
order 5 -kndiscount, and started it as a server with -order 5. Then,
to see that indeed 5-grams are working, I feed it a sentence with (a)
an existing first word present in the corpus, (b) a made-up first word
not present in the Russian language. Then I run both 5-word sentences
in two ways: (1) -order 5 -debug 2 (2) -order 0 debug 3, both for -
ppl. The results, which puzzle me, are below, followed by a
description of the puzzlement.
~ echo c этим заявлением он выступил | ngram -
use-server <badbox> -order 5 -debug 2 -ppl -
server <badbox>: probserver ready
c этим заявлением он выступил
p( c | <s> ) = 3.67342e-06 [ -5.43493 ]
p( этим | c ...) = 0.00102315 [ -2.99006 ]
p( заявлением | этим ...) = 0.00151464
[ -2.81969 ]
p( он | заявлением ...) = 0.0218172
[ -1.6612 ]
p( выступил | он ...) = 0.000925487 [ -3.03363 ]
p( </s> | выступил ...) = 0.00693155
[ -2.15917 ]
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16
file -: 1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -18.0987 ppl= 1038.6 ppl1= 4166.16
~ echo жуемотничая этим заявлением он
выступил | ngram -use-server <badbox> -order 5 -debug 2 -ppl -
server <badbox>: probserver ready
жуемотничая этим заявлением он
выступил
p( жуемотничая | <s> ) = 0 [ -inf ]
p( этим | жуемотничая ...) = 0.00014788
[ -3.83009 ]
p( заявлением | этим ...) = 0.00151464
[ -2.81969 ]
p( он | заявлением ...) = 0.0218172
[ -1.6612 ]
p( выступил | он ...) = 0.000925487 [ -3.03363 ]
p( </s> | выступил ...) = 0.00693155
[ -2.15917 ]
1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54
file -: 1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54
== notice that from the 3rd line p(word | context ...), the
conditional probs are the same, although we're using a 5-gram model
and in the second batch the first word is non-existing! We also have
0 OOVs reported there (?).
== Now, let's explore what "unlimited ngrams" mean with -order 0, and
set -debug 3 too:
~ echo с этим заявлением он выступил | ngram -
use-server <badbox> -order 0 -debug 3 -ppl -
server <badbox>: probserver ready
с этим заявлением он выступил
warning: word probs for this context sum to 0.00119158 != 1 : <s>
p( с | <s> ) = 0.000113967 [ -3.94322 ] / 0.00119158
warning: word probs for this context sum to 0.0248594 != 1 : с <s>
p( этим | с ...) = 0.00614229 [ -2.21167 ] /
0.0248594
warning: word probs for this context sum to 0.0135057 != 1 : этим
с <s>
p( заявлением | этим ...) = 0.0026996
[ -2.5687 ] / 0.0135057
warning: word probs for this context sum to 0.136629 != 1 :
заявлением этим с <s>
p( он | заявлением ...) = 0.0191721
[ -1.71733 ] / 0.136629
warning: word probs for this context sum to 0.00931138 != 1 : он
заявлением этим с <s>
p( выступил | он ...) = 0.000925487
[ -3.03363 ] / 0.00931138
warning: word probs for this context sum to 0.243228 != 1 :
выступил он заявлением этим с <s>
p( </s> | выступил ...) = 0.00693155
[ -2.15917 ] / 0.243228
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89
file -: 1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -15.6337 ppl= 403.293 ppl1= 1338.89
-----
~ echo жуемотничая этим заявлением он
выступил | ngram -use-server <badbox> -order 0 -debug 3 -ppl -
server <badbox>: probserver ready
жуемотничая этим заявлением он
выступил
warning: word probs for this context sum to 0.00107762 != 1 : <s>
p( жуемотничая | <s> ) = 0 [ -inf ] / 0.00107762
warning: word probs for this context sum to 0.0136768 != 1 :
жуемотничая <s>
p( этим | жуемотничая ...) = 0.00014788
[ -3.83009 ] / 0.0136768
warning: word probs for this context sum to 0.0105593 != 1 : этим
жуемотничая <s>
p( заявлением | этим ...) = 0.00151464
[ -2.81969 ] / 0.0105593
warning: word probs for this context sum to 0.0891667 != 1 :
заявлением этим жуемотничая <s>
p( он | заявлением ...) = 0.0218172
[ -1.6612 ] / 0.0891667
warning: word probs for this context sum to 0.00501918 != 1 : он
заявлением этим жуемотничая <s>
p( выступил | он ...) = 0.000925487
[ -3.03363 ] / 0.00501918
warning: word probs for this context sum to 0.00712921 != 1 :
выступил он заявлением этим
жуемотничая <s>
p( </s> | выступил ...) = 0.00693155
[ -2.15917 ] / 0.00712921
1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54
file -: 1 sentences, 5 words, 0 OOVs
1 zeroprobs, logprob= -13.5038 ppl= 502.061 ppl1= 2376.54
== Now we get more differences, the "real" example, the first one,
differs from the "fake" second one in the first 4 lines, the p(|)'s
are the same only for the last two lines, 5 and 6. However, the 4th
line of the first "real" case has a *lower* p( он |
заявлением ...) = 0.0191721 < p( он |
заявлением ...) = 0.0218172 in 4th line of the
second *fake* case!
Again, we see 0 OOVs reported in both cases, despite
"жуемотничая" being a fake word with 0 [-Inf] prob.
Although the final perplexities are higher for the fake case, I can't
be certain, from these results, that the -order 5 option is being
honored, and am not sure what -order 0 does here, as well as why some
conditional probability can be higher for a fake word. Also, what
exactly is the -debug 3 "word probs for this context", and why would
they cause a warning for a rather large real corpus, and how should I
interpret it?
For the reference, here's the model building command I used:
time make-batch-counts list/list-stok 100000 cat counts/5g -order 5 > /
dev/null 2>&1; time merge-batch-counts counts/5g; time make-big-lm -
name lm-ko-kn5 -lm lm-ko-kn5 -max-per-file 100000000 -kndiscount -
order 5 -read counts/5g/*.ngrams.gz
-- and here's how I launch the resulting LM server:
ngram -server-port <badport> -lm /data/rupress/lm-ko-kn5 -order 5
Cheers,
Alexy
More information about the SRILM-User
mailing list