[SRILM User List] Generate Probability Distribution

Dávid Nemeskey nemeskeyd at gmail.com
Tue Mar 14 01:58:12 PDT 2017


Hi Kalpesh,

I could send you a binary, but that, as I mentioned above, is only PAC (not
in the machine learning sense). So there would be some work involved before
- sort the words in my trie by frequency, not alphanumerically
- always check the lower trie node, esp. if the backoff weight is > 0.

These changes shouldn't take much time, and they would cut the cost
tremendously (if you want the top k words, then O(nk) instead of O(Vk)). So
I think it made more sense to send you the code, but I based it on an older
version of SRILM, so if you are using the latest one, it might not be so
simple to port just by looking at my version. If you have a GitHub user
account, though, I could give access to my private repo, and then you would
see exactly what I changed.

Best,
Dávid

On Sun, Mar 12, 2017 at 12:10 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
wrote:

> Hi Dávid,
> Thank you for your response. Are there any existing binaries which will
> help me do this quickly? I don't mind a non-SRILM ARPA file reader either.
> Yes, top N words might be good enough in my use case, especially when they
> cover more than 99% of the probability mass. I like the idea of building a
> trie to do this.
>
> Thank you,
> Kalpesh
>
> On 12 Mar 2017 1:42 p.m., "Dávid Nemeskey" <nemeskeyd at gmail.com> wrote:
>
> Hi Kalpesh,
>
> well, there's LM::wordProb(VocabIndex word, const VocabIndex *context) in
> lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model).
> You could simply call it on every word in the vocabulary. However, be
> warned that this will be very slow for any reasonable vocabulary size (say
> 10k and up). This function is also what generateWord() calls, that is why
> the latter is so slow.
>
> If you just wanted the top n most probable words, the situation would be a
> bit different. Then wordProb() wouldn't be the optimal solution because the
> trie built by ngram is reversed (meaning you have to go back from the word
> to the root, and not the other way around), and you had to query all words
> to get the most probably one. So when I wanted to do this, I built another
> trie (from the root up to the word), which made it much faster, though I am
> not sure it was 100% correct in the face of negative backoff weights. But
> it wouldn't help in your case, I guess.
>
> Best,
> Dávid
>
> On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna <kalpeshk2011 at gmail.com>
> wrote:
>
>> Hello,
>> I have a context of words and I've built an N-gram language model using
>> ./ngram-count. I wish to generate a probability distribution (over the
>> entire vocabulary of words) of the next word. I can't seem to be able to
>> find a good way to do this with ./ngram.
>> What's the best way to do this?
>> For example, if my vocabulary has words "apple, banana, carrot", and my
>> context is "apple banana banana carrot", I want a distribution like -
>> {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.
>>
>> Thank you,
>> Kalpesh Krishna
>> http://martiansideofthemoon.github.io/
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20170314/4488d620/attachment.html>


More information about the SRILM-User mailing list