[SRILM User List] Generate Probability Distribution

Andreas Stolcke stolcke at icsi.berkeley.edu
Mon Mar 13 10:11:06 PDT 2017


A brute force solution to this (if you don't want to modify any code)  
is to generate an N-gram count file of the form

apple banana banana carrot apple        1
apple banana banana carrot banana        1
apple banana banana carrot carrot        1

and pass it to

     ngram -lm LM    -order 5 -counts COUNTS -debug 2

If you want to make a minimal code change to enumerate all conditional 
probabilities for any context encountered, you could do so in 
LM::wordProbSum() and have it dump out the word tokens and their log 
probabilities.  Then process some text with ngram -debug 3.

Andreas



On 3/12/2017 12:12 AM, Dávid Nemeskey wrote:
> Hi Kalpesh,
>
> well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) 
> in lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram 
> model). You could simply call it on every word in the vocabulary. 
> However, be warned that this will be very slow for any reasonable 
> vocabulary size (say 10k and up). This function is also what 
> generateWord() calls, that is why the latter is so slow.
>
> If you just wanted the top n most probable words, the situation would 
> be a bit different. Then wordProb() wouldn't be the optimal solution 
> because the trie built by ngram is reversed (meaning you have to go 
> back from the word to the root, and not the other way around), and you 
> had to query all words to get the most probably one. So when I wanted 
> to do this, I built another trie (from the root up to the word), which 
> made it much faster, though I am not sure it was 100% correct in the 
> face of negative backoff weights. But it wouldn't help in your case, I 
> guess.
>
> Best,
> Dávid
>
> On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna 
> <kalpeshk2011 at gmail.com <mailto:kalpeshk2011 at gmail.com>> wrote:
>
>     Hello,
>     I have a context of words and I've built an N-gram language model
>     using ./ngram-count. I wish to generate a probability distribution
>     (over the entire vocabulary of words) of the next word. I can't
>     seem to be able to find a good way to do this with ./ngram.
>     What's the best way to do this?
>     For example, if my vocabulary has words "apple, banana, carrot",
>     and my context is "apple banana banana carrot", I want a
>     distribution like - {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>
>     Thank you,
>     Kalpesh Krishna
>     http://martiansideofthemoon.github.io/
>     <http://martiansideofthemoon.github.io/>
>
>     _______________________________________________
>     SRILM-User site list
>     SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
>     http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>     <http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user>
>
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user


-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170313/6e2e61e7/attachment.html>


More information about the SRILM-User mailing list