[SRILM User List] [EXTERNAL] Compute Word Probability Distribution
Müller, H.M. (Hanno)
H.Muller at let.ru.nl
Wed Feb 5 10:10:20 PST 2020
Hi,
I derived a fifth-order LM and a vocabulary from a file input.txt using ngram-count. As a second step, I would like to compute a Word Probability Distribution for all sentences in another file called test.txt, i.e. how probable each word from the vocabulary is after a given ngram. For instance, image that “bob paints a lot of pictures depicting mountains” is a sentence in test.txt. I can than prepare a file test_sentence1.txt:
bob paints a lot word_1
bob paints a lot word_2
…
bob paints a lot word_n
And compute the probability of every word_x with
ngram -ppl test_sentence1.txt -order 5 -debug 2 > ppl_sentence1.txt
The blocks of the result look somewhat like this:
bob paints a lot statistics
p( bob | <s> ) = 0.009426857 [ -2.025633 ]
p( paints | bob ...) = 0.04610244 [ -1.336276 ]
p( a | paints ...) = 0.04379878 [ -1.358538 ]
p( lot | a ...) = 0.02713076 [ -1.566538 ]
p( statistics | lot ...) = 1.85185e-09 [ -8.732394 ] <---- target: P(statistics|bob paints a lot)
p( </s> | statistics ...) = 0.04183223 [ -1.378489 ]
1 sentences, 5 words, 0 OOVs
0 zeroprobs, logprob= -23.32394 ppl= 2147.79 ppl1= 7714.783
I would then collect the probabilities of every word given that context and voilà, there goes the WPD. However, imagine doing this for a huge test.txt file and huge vocabulary file would take months to compute! So I was wondering whether there is a nicer way to compute the WPD, which is basically a measurement of the popular ‘surprisal’ concept.
Cheers,
Hanno
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20200205/528f6733/attachment.html>
More information about the SRILM-User
mailing list