<html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40">


<head>


<meta http-equiv="Content-Type" content="text/html; charset=utf-8">


<meta name="Generator" content="Microsoft Word 15 (filtered medium)">


<style><!--


/* Font Definitions */


@font-face


        {font-family:"Cambria Math";


        panose-1:2 4 5 3 5 4 6 3 2 4;}


@font-face


        {font-family:Calibri;


        panose-1:2 15 5 2 2 2 4 3 2 4;}


/* Style Definitions */


p.MsoNormal, li.MsoNormal, div.MsoNormal


        {margin:0cm;


        margin-bottom:.0001pt;


        font-size:11.0pt;


        font-family:"Calibri",sans-serif;


        mso-fareast-language:EN-US;}


a:link, span.MsoHyperlink


        {mso-style-priority:99;


        color:#0563C1;


        text-decoration:underline;}


a:visited, span.MsoHyperlinkFollowed


        {mso-style-priority:99;


        color:#954F72;


        text-decoration:underline;}


span.EmailStyle17


        {mso-style-type:personal-compose;


        font-family:"Calibri",sans-serif;


        color:windowtext;}


.MsoChpDefault


        {mso-style-type:export-only;


        font-family:"Calibri",sans-serif;


        mso-fareast-language:EN-US;}


@page WordSection1


        {size:612.0pt 792.0pt;


        margin:70.85pt 70.85pt 70.85pt 70.85pt;}


div.WordSection1


        {page:WordSection1;}


--></style><!--[if gte mso 9]><xml>


<o:shapedefaults v:ext="edit" spidmax="1026" />


</xml><![endif]--><!--[if gte mso 9]><xml>


<o:shapelayout v:ext="edit">


<o:idmap v:ext="edit" data="1" />


</o:shapelayout></xml><![endif]-->


</head>


<body lang="NL" link="#0563C1" vlink="#954F72">


<div class="WordSection1">


<p class="MsoNormal"><span lang="EN-US">Hi,<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">I derived a fifth-order LM and a vocabulary from a file input.txt using ngram-count. As a second step, I would like to compute a Word Probability Distribution for all sentences in another file called test.txt, i.e. how


 probable each word from the vocabulary is after a given ngram. For instance, image that “bob paints a lot of pictures depicting mountains” is a sentence in test.txt. I can than prepare a file test_sentence1.txt:<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">bob paints a lot word_1<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">bob paints a lot word_2<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">…<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">bob paints a lot word_n<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">And compute the probability of every word_x with


<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">ngram -ppl test_sentence1.txt -order 5 -debug 2 > ppl_sentence1.txt</span><span lang="EN-US"><o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US" style="font-family:"Courier New""><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">The blocks of the result look somewhat like this:<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">bob paints a lot statistics<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">     p( bob | <s> ) =  0.009426857 [ -2.025633 ]<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">     p( paints | bob ...)   </span>=  0.04610244 [ -1.336276 ]<o:p></o:p></p>


<p class="MsoNormal">     p( a | paints ...)    =  0.04379878 [ -1.358538 ]<o:p></o:p></p>


<p class="MsoNormal">     p( lot | a ...) =  0.02713076 [ -1.566538 ]<o:p></o:p></p>


<p class="MsoNormal"><span lang="EN-US">     p( statistics | lot ...)    =  1.85185e-09 [ -8.732394 ]    <---- target: P(statistics|bob paints a lot)<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">     p( </s> | statistics ...)    =  0.04183223 [ -1.378489 ]<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">1 sentences, 5 words, 0 OOVs<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">0 zeroprobs, logprob= -23.32394 ppl= 2147.79 ppl1= 7714.783<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">I would then collect the probabilities of every word given that context and voilà, there goes the WPD. However, imagine doing this for a huge test.txt file and huge vocabulary file would take months to compute! So I was


 wondering whether there is a nicer way to compute the WPD, which is basically a measurement of the popular ‘surprisal’ concept.<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">Cheers,<o:p></o:p></span></p>


<p class="MsoNormal"><span lang="EN-US">Hanno<o:p></o:p></span></p>


</div>


</body>


</html>