[SRILM User List] Interpreting ngram output with -debug 2 , -cache and -cache-lambda options
zeeshan khan
zeeshankhans at gmail.com
Mon Apr 11 10:59:25 PDT 2011
Hi all,
I want to understand the debug 2 output given by ngram tool using (and not
using) the -cache and -cache-lambda options.
here are the two commands using (and not using) the -cache and -cache-lambda
options :
ngram -unk "UNKNOWN" -order 4 -lm <LM> -ppl <text-file> -debug 2 -cache 350
-cache-lambda 0.1
AND
ngram -unk "UNKNOWN" -order 4 -lm <LM> -ppl <text-file> -debug 2
I have the following questions:
1. What is the meaning of [cache=xxxx] in each line and how is it
calculated.
2. I cannot understand why the 2 probabilities are different in those lines
of the output where the cache-probability is zero eg; in first 5 lines of
both outputs.
3. Can there be any case where the first entry in each line i.e. [ngram]
will be different among the two outputs ? if yes, how can it be ?
and here are the first few lines of the outputs of each command:
------------------------------------------------------------------------------------------------------------------------
WITHOUT the -cache and -cache-lambda options:
------------------------------------------------------------------------------------------------------------------------
<s> this is a podcast of the highlights from today's woman's hour copyright
issues mean that we can't always include all the items from the programme
</s>
p( this | <s> ) = [2gram] 0.0155235 [ -1.80901 ]
p( is | this ...) = [3gram] 0.384267 [ -0.415367 ]
p( a | is ...) = [4gram] 0.171555 [ -0.765597 ]
p( podcast | a ...) = [4gram] 7.7717e-06 [ -5.10948 ]
p( of | podcast ...) = [4gram] 0.108064 [ -0.966317 ]
p( the | of ...) = [4gram] 0.366697 [ -0.435692 ]
p( highlights | the ...) = [3gram] 4.88751e-05 [ -4.31091 ]
p( from | highlights ...) = [4gram] 0.077328 [ -1.11166 ]
p( today's | from ...) = [4gram] 0.00790939 [ -2.10186 ]
p( woman's | today's ...) = [2gram] 9.67272e-06 [ -5.01445 ]
p( hour | woman's ...) = [3gram] 0.218998 [ -0.659561 ]
p( copyright | hour ...) = [1gram] 3.56089e-06 [ -5.44844 ]
p( issues | copyright ...) = [2gram] 0.0196718 [ -1.70615 ]
p( mean | issues ...) = [2gram] 0.00024042 [ -3.61903 ]
p( that | mean ...) = [3gram] 0.211744 [ -0.674189 ]
p( we | that ...) = [3gram] 0.0179052 [ -1.74702 ]
p( can't | we ...) = [4gram] 0.0186763 [ -1.72871 ]
p( always | can't ...) = [4gram] 0.00198593 [ -2.70204 ]
p( include | always ...) = [3gram] 0.000752505 [ -3.12349 ]
p( all | include ...) = [3gram] 0.00575442 [ -2.24 ]
p( the | all ...) = [4gram] 0.314584 [ -0.502263 ]
p( items | the ...) = [4gram] 0.00158827 [ -2.79908 ]
p( from | items ...) = [4gram] 0.0124186 [ -1.90593 ]
p( the | from ...) = [4gram] 0.415841 [ -0.381072 ]
p( programme | the ...) = [3gram] 0.000297532 [ -3.52647 ]
p( </s> | programme ...) = [4gram] 0.288492 [ -0.539866 ]
1 sentences, 25 words, 0 OOVs
0 zeroprobs, logprob= -55.3437 ppl= 134.463 ppl1= 163.586
-----------------------------------------------------------------------------------------------------------------------
WITH the -cache and -cache-lambda options:
-----------------------------------------------------------------------------------------------------------------------
<s> this is a podcast of the highlights from today's woman's hour copyright
issues mean that we can't always include all the items from the programme
</s>
p( this | <s> ) = [2gram][cache=0] 0.0139712 [ -1.85477 ]
p( is | this ...) = [3gram][cache=0] 0.34584 [ -0.461124 ]
p( a | is ...) = [4gram][cache=0] 0.154399 [ -0.811355 ]
p( podcast | a ...) = [4gram][cache=0] 6.99453e-06 [ -5.15524 ]
p( of | podcast ...) = [4gram][cache=0] 0.0972579 [ -1.01207 ]
p( the | of ...) = [4gram][cache=0] 0.330028 [ -0.48145 ]
p( highlights | the ...) = [3gram][cache=0] 4.39876e-05 [
-4.35667 ]
p( from | highlights ...) = [4gram][cache=0] 0.0695952 [
-1.15742 ]
p( today's | from ...) = [4gram][cache=0] 0.00711845 [ -2.14761 ]
p( woman's | today's ...) = [2gram][cache=0] 8.70545e-06 [
-5.06021 ]
p( hour | woman's ...) = [3gram][cache=0] 0.197098 [ -0.705318 ]
p( copyright | hour ...) = [1gram][cache=0] 3.2048e-06 [
-5.4942 ]
p( issues | copyright ...) = [2gram][cache=0] 0.0177047 [
-1.75191 ]
p( mean | issues ...) = [2gram][cache=0] 0.000216378 [ -3.66479 ]
p( that | mean ...) = [3gram][cache=0] 0.190569 [ -0.719947 ]
p( we | that ...) = [3gram][cache=0] 0.0161147 [ -1.79278 ]
p( can't | we ...) = [4gram][cache=0] 0.0168087 [ -1.77447 ]
p( always | can't ...) = [4gram][cache=0] 0.00178733 [ -2.74779 ]
p( include | always ...) = [3gram][cache=0] 0.000677254 [
-3.16925 ]
p( all | include ...) = [3gram][cache=0] 0.00517898 [ -2.28576 ]
p( the | all ...) = [4gram][cache=0.05] 0.288126 [ -0.540418 ]
p( items | the ...) = [4gram][cache=0] 0.00142944 [ -2.84483 ]
p( from | items ...) = [4gram][cache=0.0454545] 0.0157222 [
-1.80349 ]
p( the | from ...) = [4gram][cache=0.0869565] 0.382953 [
-0.416855 ]
p( programme | the ...) = [3gram][cache=0] 0.000267779 [
-3.57222 ]
p( </s> | programme ...) = [4gram][cache=0] 0.259643 [
-0.585623 ]
1 sentences, 25 words, 0 OOVs
0 zeroprobs, logprob= -56.3676 ppl= 147.226 ppl1= 179.764
-----------------------------------------------------------------------------------------------------------------------
best regards,
Zeeshan Khan
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20110411/d9bf794a/attachment.html>
More information about the SRILM-User
mailing list