[SRILM User List] Generate Probability Distribution

Kalpesh Krishna kalpeshk2011 at gmail.com
Tue Mar 14 22:48:33 PDT 2017

Thank you Andreas! This approach is getting me the probabilities really
quickly (within 0.5 seconds including steps of pre and post processing in a
Python wrapper on a single core). It was very satisifying to see
`np.sum(distribution)` returning values like `0.99999994929300007`.
Thank you for your help Dávid! I'd love to have a look at your code. Here
is my Github handle - martiansideofthemoon

With Regards,
Kalpesh Krishna

On Mon, Mar 13, 2017 at 10:41 PM, Andreas Stolcke <stolcke at icsi.berkeley.edu
> wrote:

> A brute force solution to this (if you don't want to modify any code)  is
> to generate an N-gram count file of the form
> apple banana banana carrot apple        1
> apple banana banana carrot banana        1
> apple banana banana carrot carrot        1
> and pass it to
>     ngram -lm LM    -order 5 -counts COUNTS -debug 2
> If you want to make a minimal code change to enumerate all conditional
> probabilities for any context encountered, you could do so in
> LM::wordProbSum() and have it dump out the word tokens and their log
> probabilities.  Then process some text with ngram -debug 3.
> Andreas
> On 3/12/2017 12:12 AM, Dávid Nemeskey wrote:
> Hi Kalpesh,
> well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in
> lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model).
> You could simply call it on every word in the vocabulary. However, be
> warned that this will be very slow for any reasonable vocabulary size (say
> 10k and up). This function is also what generateWord() calls, that is why
> the latter is so slow.
> If you just wanted the top n most probable words, the situation would be a
> bit different. Then wordProb() wouldn't be the optimal solution because the
> trie built by ngram is reversed (meaning you have to go back from the word
> to the root, and not the other way around), and you had to query all words
> to get the most probably one. So when I wanted to do this, I built another
> trie (from the root up to the word), which made it much faster, though I am
> not sure it was 100% correct in the face of negative backoff weights. But
> it wouldn't help in your case, I guess.
> Best,
> Dávid
> On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
> wrote:
>> Hello,
>> I have a context of words and I've built an N-gram language model using
>> ./ngram-count. I wish to generate a probability distribution (over the
>> entire vocabulary of words) of the next word. I can't seem to be able to
>> find a good way to do this with ./ngram.
>> What's the best way to do this?
>> For example, if my vocabulary has words "apple, banana, carrot", and my
>> context is "apple banana banana carrot", I want a distribution like -
>> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>> Thank you,
>> Kalpesh Krishna
>> http://martiansideofthemoon.github.io/
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
> _______________________________________________
> SRILM-User site listSRILM-User at speech.sri.comhttp://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user

Kalpesh Krishna,
Junior Undergraduate,
Electrical Engineering,
IIT Bombay
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170315/4d8ff842/attachment.html>

More information about the SRILM-User mailing list