[SRILM User List] [EXTERNAL] Compute Word Probability Distribution

Andreas Stolcke stolcke at icsi.berkeley.edu
Thu Feb 6 05:53:22 PST 2020


You should use the ngram -counts option and feed it only the 5grams you 
are interested in.  This will keep you from having to compute all the 
word probabilities earlier in the sentence.

An even more efficient solution is available, but only at the API level 
and not in any of the command-line tools.  The function 
WordProbRecompute() provides an efficient way to look up the conditional 
probabilities for multiple words in the same LM context.    You'd have 
to write some C++ code to
1 - read a list of LM histories, and for each of them
2 - for each word in the vocab, call WordProbRecompute() on that history 
and word.
3 - write out the results.

The function LM::wordProbSum(const VocabIndex *context) in lm/src/LM.cc 
shows how to do step 2.

Andreas

On 2/5/2020 10:10 AM, Müller, H.M. (Hanno) wrote:
>
> Hi,
>
> I derived a fifth-order LM and a vocabulary from a file input.txt 
> using ngram-count. As a second step, I would like to compute a Word 
> Probability Distribution for all sentences in another file called 
> test.txt, i.e. how probable each word from the vocabulary is after a 
> given ngram. For instance, image that “bob paints a lot of pictures 
> depicting mountains” is a sentence in test.txt. I can than prepare a 
> file test_sentence1.txt:
>
> bob paints a lot word_1
>
> bob paints a lot word_2
>
>>
> bob paints a lot word_n
>
> And compute the probability of every word_x with
>
> ngram -ppl test_sentence1.txt -order 5 -debug 2 > ppl_sentence1.txt
>
> The blocks of the result look somewhat like this:
>
> bob paints a lot statistics
>
>      p( bob | <s> ) =  0.009426857 [ -2.025633 ]
>
>      p( paints | bob ...) =  0.04610244 [ -1.336276 ]
>
>      p( a | paints ...)    =  0.04379878 [ -1.358538 ]
>
>      p( lot | a ...) =  0.02713076 [ -1.566538 ]
>
>      p( statistics | lot ...)    =  1.85185e-09 [ -8.732394 ]    <---- 
> target: P(statistics|bob paints a lot)
>
>      p( </s> | statistics ...)    =  0.04183223 [ -1.378489 ]
>
> 1 sentences, 5 words, 0 OOVs
>
> 0 zeroprobs, logprob= -23.32394 ppl= 2147.79 ppl1= 7714.783
>
> I would then collect the probabilities of every word given that 
> context and voilà, there goes the WPD. However, imagine doing this for 
> a huge test.txt file and huge vocabulary file would take months to 
> compute! So I was wondering whether there is a nicer way to compute 
> the WPD, which is basically a measurement of the popular ‘surprisal’ 
> concept.
>
> Cheers,
>
> Hanno
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20200206/27380fb0/attachment.html>


More information about the SRILM-User mailing list