<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">You should use the ngram -counts option
and feed it only the 5grams you are interested in. This will keep
you from having to compute all the word probabilities earlier in
the sentence.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">An even more efficient solution is
available, but only at the API level and not in any of the
command-line tools. The function WordProbRecompute() provides an
efficient way to look up the conditional probabilities for
multiple words in the same LM context. You'd have to write some
C++ code to <br>
</div>
<div class="moz-cite-prefix">1 - read a list of LM histories, and
for each of them<br>
</div>
<div class="moz-cite-prefix">2 - for each word in the vocab, call
WordProbRecompute() on that history and word.</div>
<div class="moz-cite-prefix">3 - write out the results.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">The function LM::wordProbSum(const
VocabIndex *context) in lm/src/LM.cc shows how to do step 2.</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Andreas</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 2/5/2020 10:10 AM, Müller, H.M.
(Hanno) wrote:<br>
</div>
<blockquote type="cite"
cite="mid:5dce5168107f4b5088a221033a2a30d2@EXPRD06.hosting.ru.nl">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<meta name="Generator" content="Microsoft Word 15 (filtered
medium)">
<style><!--
/* Font Definitions */
@font-face
{font-family:"Cambria Math";
panose-1:2 4 5 3 5 4 6 3 2 4;}
@font-face
{font-family:Calibri;
panose-1:2 15 5 2 2 2 4 3 2 4;}
/* Style Definitions */
p.MsoNormal, li.MsoNormal, div.MsoNormal
{margin:0cm;
margin-bottom:.0001pt;
font-size:11.0pt;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
a:link, span.MsoHyperlink
{mso-style-priority:99;
color:#0563C1;
text-decoration:underline;}
a:visited, span.MsoHyperlinkFollowed
{mso-style-priority:99;
color:#954F72;
text-decoration:underline;}
span.EmailStyle17
{mso-style-type:personal-compose;
font-family:"Calibri",sans-serif;
color:windowtext;}
.MsoChpDefault
{mso-style-type:export-only;
font-family:"Calibri",sans-serif;
mso-fareast-language:EN-US;}
@page WordSection1
{size:612.0pt 792.0pt;
margin:70.85pt 70.85pt 70.85pt 70.85pt;}
div.WordSection1
{page:WordSection1;}
--></style><!--[if gte mso 9]><xml>
<o:shapedefaults v:ext="edit" spidmax="1026" />
</xml><![endif]--><!--[if gte mso 9]><xml>
<o:shapelayout v:ext="edit">
<o:idmap v:ext="edit" data="1" />
</o:shapelayout></xml><![endif]-->
<div class="WordSection1">
<p class="MsoNormal"><span lang="EN-US">Hi,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">I derived a fifth-order
LM and a vocabulary from a file input.txt using ngram-count.
As a second step, I would like to compute a Word Probability
Distribution for all sentences in another file called
test.txt, i.e. how probable each word from the vocabulary is
after a given ngram. For instance, image that “bob paints a
lot of pictures depicting mountains” is a sentence in
test.txt. I can than prepare a file test_sentence1.txt:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">bob paints a lot word_1<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">bob paints a lot word_2<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">…<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">bob paints a lot word_n<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">And compute the
probability of every word_x with
<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">ngram -ppl
test_sentence1.txt -order 5 -debug 2 > ppl_sentence1.txt</span><span
lang="EN-US"><o:p></o:p></span></p>
<p class="MsoNormal"><span style="font-family:"Courier
New"" lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">The blocks of the result
look somewhat like this:<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">bob paints a lot
statistics<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> p( bob | <s>
) = 0.009426857 [ -2.025633 ]<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> p( paints | bob
...) </span>= 0.04610244 [ -1.336276 ]<o:p></o:p></p>
<p class="MsoNormal"> p( a | paints ...) = 0.04379878 [
-1.358538 ]<o:p></o:p></p>
<p class="MsoNormal"> p( lot | a ...) = 0.02713076 [
-1.566538 ]<o:p></o:p></p>
<p class="MsoNormal"><span lang="EN-US"> p( statistics | lot
...) = 1.85185e-09 [ -8.732394 ] <---- target:
P(statistics|bob paints a lot)<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"> p( </s> |
statistics ...) = 0.04183223 [ -1.378489 ]<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">1 sentences, 5 words, 0
OOVs<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">0 zeroprobs, logprob=
-23.32394 ppl= 2147.79 ppl1= 7714.783<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">I would then collect the
probabilities of every word given that context and voilà,
there goes the WPD. However, imagine doing this for a huge
test.txt file and huge vocabulary file would take months to
compute! So I was wondering whether there is a nicer way to
compute the WPD, which is basically a measurement of the
popular ‘surprisal’ concept.<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US"><o:p> </o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Cheers,<o:p></o:p></span></p>
<p class="MsoNormal"><span lang="EN-US">Hanno<o:p></o:p></span></p>
</div>
<br>
<fieldset class="mimeAttachmentHeader"></fieldset>
<pre class="moz-quote-pre" wrap="">_______________________________________________
SRILM-User site list
<a class="moz-txt-link-abbreviated" href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a>
<a class="moz-txt-link-freetext" href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user">http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user</a></pre>
</blockquote>
<p><br>
</p>
</body>
</html>