<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">You should use the ngram -counts option

      and feed it only the 5grams you are interested in.  This will keep

      you from having to compute all the word probabilities earlier in

      the sentence.</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">An even more efficient solution is

      available, but only at the API level and not in any of the

      command-line tools.  The function WordProbRecompute() provides an

      efficient way to look up the conditional probabilities for

      multiple words in the same LM context.    You'd have to write some

      C++ code to <br>

    </div>

    <div class="moz-cite-prefix">1 - read a list of LM histories, and

      for each of them<br>

    </div>

    <div class="moz-cite-prefix">2 - for each word in the vocab, call

      WordProbRecompute() on that history and word.</div>

    <div class="moz-cite-prefix">3 - write out the results.</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">The function LM::wordProbSum(const

      VocabIndex *context) in lm/src/LM.cc shows how to do step 2.</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Andreas</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">On 2/5/2020 10:10 AM, Müller, H.M.

      (Hanno) wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:5dce5168107f4b5088a221033a2a30d2@EXPRD06.hosting.ru.nl">

      <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

      <meta name="Generator" content="Microsoft Word 15 (filtered

        medium)">

      <style><!--

/* Font Definitions */

@font-face

        {font-family:"Cambria Math";

        panose-1:2 4 5 3 5 4 6 3 2 4;}

@font-face

        {font-family:Calibri;

        panose-1:2 15 5 2 2 2 4 3 2 4;}

/* Style Definitions */

p.MsoNormal, li.MsoNormal, div.MsoNormal

        {margin:0cm;

        margin-bottom:.0001pt;

        font-size:11.0pt;

        font-family:"Calibri",sans-serif;

        mso-fareast-language:EN-US;}

a:link, span.MsoHyperlink

        {mso-style-priority:99;

        color:#0563C1;

        text-decoration:underline;}

a:visited, span.MsoHyperlinkFollowed

        {mso-style-priority:99;

        color:#954F72;

        text-decoration:underline;}

span.EmailStyle17

        {mso-style-type:personal-compose;

        font-family:"Calibri",sans-serif;

        color:windowtext;}

.MsoChpDefault

        {mso-style-type:export-only;

        font-family:"Calibri",sans-serif;

        mso-fareast-language:EN-US;}

@page WordSection1

        {size:612.0pt 792.0pt;

        margin:70.85pt 70.85pt 70.85pt 70.85pt;}

div.WordSection1

        {page:WordSection1;}

--></style><!--[if gte mso 9]><xml>

<o:shapedefaults v:ext="edit" spidmax="1026" />

</xml><![endif]--><!--[if gte mso 9]><xml>

<o:shapelayout v:ext="edit">

<o:idmap v:ext="edit" data="1" />

</o:shapelayout></xml><![endif]-->

      <div class="WordSection1">

        <p class="MsoNormal"><span lang="EN-US">Hi,<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">I derived a fifth-order

            LM and a vocabulary from a file input.txt using ngram-count.

            As a second step, I would like to compute a Word Probability

            Distribution for all sentences in another file called

            test.txt, i.e. how probable each word from the vocabulary is

            after a given ngram. For instance, image that “bob paints a

            lot of pictures depicting mountains” is a sentence in

            test.txt. I can than prepare a file test_sentence1.txt:<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">bob paints a lot word_1<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">bob paints a lot word_2<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">…<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">bob paints a lot word_n<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">And compute the

            probability of every word_x with

            <o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">ngram -ppl

            test_sentence1.txt -order 5 -debug 2 > ppl_sentence1.txt</span><span

            lang="EN-US"><o:p></o:p></span></p>

        <p class="MsoNormal"><span style="font-family:"Courier

            New"" lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">The blocks of the result

            look somewhat like this:<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">bob paints a lot

            statistics<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">     p( bob | <s>

            ) =  0.009426857 [ -2.025633 ]<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">     p( paints | bob

            ...)   </span>=  0.04610244 [ -1.336276 ]<o:p></o:p></p>

        <p class="MsoNormal">     p( a | paints ...)    =  0.04379878 [

          -1.358538 ]<o:p></o:p></p>

        <p class="MsoNormal">     p( lot | a ...) =  0.02713076 [

          -1.566538 ]<o:p></o:p></p>

        <p class="MsoNormal"><span lang="EN-US">     p( statistics | lot

            ...)    =  1.85185e-09 [ -8.732394 ]    <---- target:

            P(statistics|bob paints a lot)<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">     p( </s> |

            statistics ...)    =  0.04183223 [ -1.378489 ]<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">1 sentences, 5 words, 0

            OOVs<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">0 zeroprobs, logprob=

            -23.32394 ppl= 2147.79 ppl1= 7714.783<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">I would then collect the

            probabilities of every word given that context and voilà,

            there goes the WPD. However, imagine doing this for a huge

            test.txt file and huge vocabulary file would take months to

            compute! So I was wondering whether there is a nicer way to

            compute the WPD, which is basically a measurement of the

            popular ‘surprisal’ concept.<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">Cheers,<o:p></o:p></span></p>

        <p class="MsoNormal"><span lang="EN-US">Hanno<o:p></o:p></span></p>

      </div>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <pre class="moz-quote-pre" wrap="">_______________________________________________

SRILM-User site list

<a class="moz-txt-link-abbreviated" href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a>

<a class="moz-txt-link-freetext" href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user">http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user</a></pre>

    </blockquote>

    <p><br>

    </p>

  </body>

</html>