[SRILM User List] ppl output from ngram interpret

Wed Apr 16 10:44:14 PDT 2014

Hi Andreas and 贺天行,

Thanks. I understand now.

Jian

On Wed, Apr 16, 2014 at 5:50 PM, Andreas Stolcke
<stolcke at icsi.berkeley.edu>wrote:

>  On 4/16/2014 3:20 AM, jian zhang wrote:
>
> Hi Andreas,
>
>  I am confused about the ppl output from ngram.
> The following are the outputs from two sentences,
>
>  resumption of the session
>  p( resumption | <s> ) = [1gram] 6.41856e-07 [ -6.19256 ]
>  p( of | resumption ...) = [2gram] 0.547254 [ -0.261811 ]
>  *p( the | of ...) = [2gram] 0.0826684 [ -1.08266 ]*
>  p( session | the ...) = [1gram] 1.21666e-06 [ -5.91483 ]
>  p( </s> | session ...) = [1gram] 0.00150439 [ -2.82264 ]
> 1 sentences, 4 words, 0 OOVs
> 0 zeroprobs, logprob= -16.2745 ppl= 1798.46 ppl1= 11711.9
> 4 words, rank1= 0.25 rank5= 0.5 rank10= 0.5
> 5 words+sents, rank1wSent= 0.2 rank5wSent= 0.4 rank10wSent= 0.4 qloss=
> 0.899274 absloss= 0.873714
>
>  you have requested a debate on this subject in the course of the next
> few days , during this part-session .
>  p( you | <s> ) = [2gram] 0.000716442 [ -3.14482 ]
>  p( have | you ...) = [2gram] 0.0179397 [ -1.74618 ]
>  p( requested | have ...) = [1gram] 6.43992e-06 [ -5.19112 ]
>  p( a | requested ...) = [1gram] 0.00378035 [ -2.42247 ]
>  p( debate | a ...) = [2gram] 0.000358849 [ -3.44509 ]
>  p( on | debate ...) = [2gram] 0.0598839 [ -1.22269 ]
>  p( this | on ...) = [2gram] 0.00443142 [ -2.35346 ]
>  p( subject | this ...) = [2gram] 9.54276e-05 [ -4.02033 ]
>  p( in | subject ...) = [2gram] 0.0436281 [ -1.36023 ]
>  p( the | in ...) = [2gram] 0.147714 [ -0.830578 ]
>  p( course | the ...) = [3gram] 0.00139691 [ -2.85483 ]
>  p( of | course ...) = [3gram] 0.579381 [ -0.237035 ]
>  *p( the | of ...) = [2gram] 0.0762541 [ -1.11774 ]*
>  p( next | the ...) = [3gram] 0.00123622 [ -2.9079 ]
>  p( few | next ...) = [3gram] 0.0245328 [ -1.61025 ]
>  p( days | few ...) = [2gram] 0.00340647 [ -2.46769 ]
>  p( , | days ...) = [2gram] 0.15756 [ -0.802555 ]
>  p( during | , ...) = [2gram] 0.000749831 [ -3.12504 ]
>  p( this | during ...) = [3gram] 0.0352358 [ -1.45302 ]
>  p( <unk> | this ...) = [1gram] 9.0905e-07 [ -6.04141 ]
>  p( . | <unk> ...) = [1gram] 0.0254746 [ -1.59389 ]
>  p( </s> | . ...) = [2gram] 0.809733 [ -0.091658 ]
> 1 sentences, 21 words, 0 OOVs
> 0 zeroprobs, logprob= -50.04 ppl= 188.168 ppl1= 241.466
> 21 words, rank1= 0.142857 rank5= 0.428571 rank10= 0.47619
> 22 words+sents, rank1wSent= 0.181818 rank5wSent= 0.454545 rank10wSent= 0.5
> qloss= 0.930912 absloss= 0.909386
>
>  My two questions:
> 1. There are 2-gram p( the | of ...) computed from both sentences, why
> they have different probability (first sentence gives 0.0826684, second
> sentence gives 0.0762541)?
>
> Because the backoff weights are dependent on the trigram context.
> So the first probability equals
>         bow("resumption of") * p("the"| "of")
> whereas the second probability is
>         bow("course of") * p("the" | "of")
>
>   2. Is there a parameter setting for ngram which is able to print out
> the actual tokens instead of ellipsis.
>
>
>   No, unfortunately.  The idea behind the output format was to keep the
> number of fields constant so as to facilitate parsing with awk/perl/etc.
>
> Andreas
>
>

-- 
Jian Zhang
Centre for Next Generation Localisation (CNGL)<http://www.cngl.ie/index.html>
Dublin City University <http://www.dcu.ie/>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140416/c7d22d6e/attachment.html>