Compute Word Probability Distribution

Müller, H.M. (Hanno) H.Muller at let.ru.nl
Wed Feb 5 10:10:20 PST 2020


I derived a fifth-order LM and a vocabulary from a file input.txt using ngram-count. As a second step, I would like to compute a Word Probability Distribution for all sentences in another file called test.txt, i.e. how probable each word from the vocabulary is after a given ngram. For instance, image that “bob paints a lot of pictures depicting mountains” is a sentence in test.txt. I can than prepare a file test_sentence1.txt:

bob paints a lot word_1
bob paints a lot word_2
bob paints a lot word_n

And compute the probability of every word_x with

ngram -ppl test_sentence1.txt -order 5 -debug 2 > ppl_sentence1.txt

The blocks of the result look somewhat like this:

bob paints a lot statistics
     p( bob | <s> ) =  0.009426857 [ -2.025633 ]
     p( paints | bob ...)   =  0.04610244 [ -1.336276 ]
     p( a | paints ...)    =  0.04379878 [ -1.358538 ]
     p( lot | a ...) =  0.02713076 [ -1.566538 ]
     p( statistics | lot ...)    =  1.85185e-09 [ -8.732394 ]    <---- target: P(statistics|bob paints a lot)
     p( </s> | statistics ...)    =  0.04183223 [ -1.378489 ]
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -23.32394 ppl= 2147.79 ppl1= 7714.783

I would then collect the probabilities of every word given that context and voilà, there goes the WPD. However, imagine doing this for a huge test.txt file and huge vocabulary file would take months to compute! So I was wondering whether there is a nicer way to compute the WPD, which is basically a measurement of the popular ‘surprisal’ concept.

