<div dir="ltr"><div><div><div><div><div><div>Hi Kalpesh,<br><br></div>I could send you a binary, but that, as I mentioned above, is only PAC (not in the machine learning sense). So there would be some work involved before<br></div>- sort the words in my trie by frequency, not alphanumerically<br></div>- always check the lower trie node, esp. if the backoff weight is > 0.<br><br></div>These changes shouldn't take much time, and they would cut the cost tremendously (if you want the top k words, then O(nk) instead of O(Vk)). So I think it made more sense to send you the code, but I based it on an older version of SRILM, so if you are using the latest one, it might not be so simple to port just by looking at my version. If you have a GitHub user account, though, I could give access to my private repo, and then you would see exactly what I changed.<br><br></div>Best,<br></div>Dávid<br></div><div class="gmail_extra"><br><div class="gmail_quote">On Sun, Mar 12, 2017 at 12:10 PM, Kalpesh Krishna <span dir="ltr"><<a href="mailto:kalpeshk2011@gmail.com" target="_blank">kalpeshk2011@gmail.com</a>></span> wrote:<br><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="auto"><div>Hi <span style="font-family:sans-serif">Dávid,</span></div><div dir="auto"><font face="sans-serif">Thank you for your response. Are there any existing binaries which will help me do this quickly? I don't mind a non-SRILM ARPA file reader either.</font></div><div dir="auto"><font face="sans-serif">Yes, top N words might be good enough in my use case, especially when they cover more than 99% of the probability mass. I like the idea of building a trie to do this.</font></div><div dir="auto"><font face="sans-serif"><br></font></div><div dir="auto"><font face="sans-serif">Thank you,</font></div><div dir="auto"><font face="sans-serif">Kalpesh<br></font><div><div class="h5"><div class="gmail_extra" dir="auto"><br><div class="gmail_quote">On 12 Mar 2017 1:42 p.m., "Dávid Nemeskey" <<a href="mailto:nemeskeyd@gmail.com" target="_blank">nemeskeyd@gmail.com</a>> wrote:<br type="attribution"><blockquote class="m_2765005555023026095quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div dir="ltr"><div><div>Hi Kalpesh,<br><br></div>well, there's LM::<span class="m_2765005555023026095m_4034246613824824293gmail-pl-en">wordProb</span>(VocabIndex word, <span class="m_2765005555023026095m_4034246613824824293gmail-pl-k">const</span> VocabIndex *context<span class="m_2765005555023026095m_4034246613824824293gmail-pl-k"></span><span class="m_2765005555023026095m_4034246613824824293gmail-pl-k"></span>) in lm/src/LM.cc (and in lm/src/NgramLM.cc, if you are using an ngram model). You could simply call it on every word in the vocabulary. However, be warned that this will be very slow for any reasonable vocabulary size (say 10k and up). This function is also what generateWord() calls, that is why the latter is so slow.<br><br>If you just wanted the top n most probable words, the situation would be a bit different. Then wordProb() wouldn't be the optimal solution because the trie built by ngram is reversed (meaning you have to go back from the word to the root, and not the other way around), and you had to query all words to get the most probably one. So when I wanted to do this, I built another trie (from the root up to the word), which made it much faster, though I am not sure it was 100% correct in the face of negative backoff weights. But it wouldn't help in your case, I guess.<br><br></div><div>Best,<br></div><div>Dávid<br></div></div><div class="gmail_extra"><br><div class="gmail_quote"><div class="m_2765005555023026095elided-text">On Sat, Mar 11, 2017 at 8:32 PM, Kalpesh Krishna <span dir="ltr"><<a href="mailto:kalpeshk2011@gmail.com" target="_blank">kalpeshk2011@gmail.com</a>></span> wrote:<br></div><blockquote class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div class="m_2765005555023026095elided-text"><div dir="ltr">Hello,<div>I have a context of words and I've built an N-gram language model using ./ngram-count. I wish to generate a probability distribution (over the entire vocabulary of words) of the next word. I can't seem to be able to find a good way to do this with ./ngram.</div><div>What's the best way to do this?</div><div>For example, if my vocabulary has words "apple, banana, carrot", and my context is "apple banana banana carrot", I want a distribution like - {"apple": 0.25, "banana": 0.5, "carrot": 0.25}.</div><div><br></div><div>Thank you,</div><div>Kalpesh Krishna</div><div><a href="http://martiansideofthemoon.github.io/" target="_blank">http://martiansideofthemoon.gi<wbr>thub.io/</a><br></div><img class="m_2765005555023026095m_4034246613824824293m_8658768150482666164mailtrack-img" height="0" width="0"></div>
<br></div>______________________________<wbr>_________________<br>
SRILM-User site list<br>
<a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a><br>
<a href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user" rel="noreferrer" target="_blank">http://mailman.speech.sri.com/<wbr>cgi-bin/mailman/listinfo/srilm<wbr>-user</a><br></blockquote></div><br></div>
<br>______________________________<wbr>_________________<br>
SRILM-User site list<br>
<a href="mailto:SRILM-User@speech.sri.com" target="_blank">SRILM-User@speech.sri.com</a><br>
<a href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user" rel="noreferrer" target="_blank">http://mailman.speech.sri.com/<wbr>cgi-bin/mailman/listinfo/srilm<wbr>-user</a><br></blockquote></div><br></div></div></div></div></div>
</blockquote></div><br></div>