[SRILM User List] ppl output from ngram interpret

Wed Apr 16 09:50:00 PDT 2014

On 4/16/2014 3:20 AM, jian zhang wrote:
> Hi Andreas,
>
> I am confused about the ppl output from ngram.
> The following are the outputs from two sentences,
>
> resumption of the session
> p( resumption | <s> ) = [1gram] 6.41856e-07 [ -6.19256 ]
> p( of | resumption ...) = [2gram] 0.547254 [ -0.261811 ]
> *p( the | of ...) = [2gram] 0.0826684 [ -1.08266 ]*
> p( session | the ...) = [1gram] 1.21666e-06 [ -5.91483 ]
> p( </s> | session ...) = [1gram] 0.00150439 [ -2.82264 ]
> 1 sentences, 4 words, 0 OOVs
> 0 zeroprobs, logprob= -16.2745 ppl= 1798.46 ppl1= 11711.9
> 4 words, rank1= 0.25 rank5= 0.5 rank10= 0.5
> 5 words+sents, rank1wSent= 0.2 rank5wSent= 0.4 rank10wSent= 0.4 qloss= 
> 0.899274 absloss= 0.873714
>
> you have requested a debate on this subject in the course of the next 
> few days , during this part-session .
> p( you | <s> ) = [2gram] 0.000716442 [ -3.14482 ]
> p( have | you ...) = [2gram] 0.0179397 [ -1.74618 ]
> p( requested | have ...) = [1gram] 6.43992e-06 [ -5.19112 ]
> p( a | requested ...) = [1gram] 0.00378035 [ -2.42247 ]
> p( debate | a ...) = [2gram] 0.000358849 [ -3.44509 ]
> p( on | debate ...) = [2gram] 0.0598839 [ -1.22269 ]
> p( this | on ...) = [2gram] 0.00443142 [ -2.35346 ]
> p( subject | this ...) = [2gram] 9.54276e-05 [ -4.02033 ]
> p( in | subject ...) = [2gram] 0.0436281 [ -1.36023 ]
> p( the | in ...) = [2gram] 0.147714 [ -0.830578 ]
> p( course | the ...) = [3gram] 0.00139691 [ -2.85483 ]
> p( of | course ...) = [3gram] 0.579381 [ -0.237035 ]
> *p( the | of ...) = [2gram] 0.0762541 [ -1.11774 ]*
> p( next | the ...) = [3gram] 0.00123622 [ -2.9079 ]
> p( few | next ...) = [3gram] 0.0245328 [ -1.61025 ]
> p( days | few ...) = [2gram] 0.00340647 [ -2.46769 ]
> p( , | days ...) = [2gram] 0.15756 [ -0.802555 ]
> p( during | , ...) = [2gram] 0.000749831 [ -3.12504 ]
> p( this | during ...) = [3gram] 0.0352358 [ -1.45302 ]
> p( <unk> | this ...) = [1gram] 9.0905e-07 [ -6.04141 ]
> p( . | <unk> ...) = [1gram] 0.0254746 [ -1.59389 ]
> p( </s> | . ...) = [2gram] 0.809733 [ -0.091658 ]
> 1 sentences, 21 words, 0 OOVs
> 0 zeroprobs, logprob= -50.04 ppl= 188.168 ppl1= 241.466
> 21 words, rank1= 0.142857 rank5= 0.428571 rank10= 0.47619
> 22 words+sents, rank1wSent= 0.181818 rank5wSent= 0.454545 rank10wSent= 
> 0.5 qloss= 0.930912 absloss= 0.909386
>
> My two questions:
> 1. There are 2-gram p( the | of ...) computed from both sentences, why 
> they have different probability (first sentence gives 0.0826684, 
> second sentence gives 0.0762541)?
Because the backoff weights are dependent on the trigram context.
So the first probability equals
         bow("resumption of") * p("the"| "of")
whereas the second probability is
         bow("course of") * p("the" | "of")
> 2. Is there a parameter setting for ngram which is able to print out 
> the actual tokens instead of ellipsis.
>
>
No, unfortunately.  The idea behind the output format was to keep the 
number of fields constant so as to facilitate parsing with awk/perl/etc.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140416/2ba75be0/attachment.html>