[SRILM User List] [EXTERNAL] Compute Word Probability Distribution
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Feb 6 05:53:22 PST 2020
You should use the ngram -counts option and feed it only the 5grams you
are interested in. This will keep you from having to compute all the
word probabilities earlier in the sentence.
An even more efficient solution is available, but only at the API level
and not in any of the command-line tools. The function
WordProbRecompute() provides an efficient way to look up the conditional
probabilities for multiple words in the same LM context. You'd have
to write some C++ code to
1 - read a list of LM histories, and for each of them
2 - for each word in the vocab, call WordProbRecompute() on that history
and word.
3 - write out the results.
The function LM::wordProbSum(const VocabIndex *context) in lm/src/LM.cc
shows how to do step 2.
Andreas
On 2/5/2020 10:10 AM, Müller, H.M. (Hanno) wrote:
>
> Hi,
>
> I derived a fifth-order LM and a vocabulary from a file input.txt
> using ngram-count. As a second step, I would like to compute a Word
> Probability Distribution for all sentences in another file called
> test.txt, i.e. how probable each word from the vocabulary is after a
> given ngram. For instance, image that “bob paints a lot of pictures
> depicting mountains” is a sentence in test.txt. I can than prepare a
> file test_sentence1.txt:
>
> bob paints a lot word_1
>
> bob paints a lot word_2
>
> …
>
> bob paints a lot word_n
>
> And compute the probability of every word_x with
>
> ngram -ppl test_sentence1.txt -order 5 -debug 2 > ppl_sentence1.txt
>
> The blocks of the result look somewhat like this:
>
> bob paints a lot statistics
>
> p( bob | <s> ) = 0.009426857 [ -2.025633 ]
>
> p( paints | bob ...) = 0.04610244 [ -1.336276 ]
>
> p( a | paints ...) = 0.04379878 [ -1.358538 ]
>
> p( lot | a ...) = 0.02713076 [ -1.566538 ]
>
> p( statistics | lot ...) = 1.85185e-09 [ -8.732394 ] <----
> target: P(statistics|bob paints a lot)
>
> p( </s> | statistics ...) = 0.04183223 [ -1.378489 ]
>
> 1 sentences, 5 words, 0 OOVs
>
> 0 zeroprobs, logprob= -23.32394 ppl= 2147.79 ppl1= 7714.783
>
> I would then collect the probabilities of every word given that
> context and voilà, there goes the WPD. However, imagine doing this for
> a huge test.txt file and huge vocabulary file would take months to
> compute! So I was wondering whether there is a nicer way to compute
> the WPD, which is basically a measurement of the popular ‘surprisal’
> concept.
>
> Cheers,
>
> Hanno
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20200206/27380fb0/attachment.html>
More information about the SRILM-User
mailing list