Querying count-based LM for specific n-gram probabilities

Andrew Goldberg goldberg at cs.wisc.edu
Sat Apr 5 21:15:09 PDT 2008


Dear list,

I am using the Google 1T ngram corpus, and have successfully built a  
count-based LM as per the instructions on the FAQ. Thanks for those  
tips to get started! I have also been able to compute perplexities  
for test sentences using the -ppl option of the ngram program, and  
got this working with the newer server options, too! Very cool.

However, what I really want to do is to be able to retrieve just the  
probabilities for particular n-grams to use them in another  
application. In other words, given a word and a history (say, words  
h1 h2 h3 h4), I would like to know the LM's probability P( word | h1  
h2 h3 ), after taking into account interpolation, etc. I know one  
hack-ish way to do this would be to put "h1 h2 h3 h4 w" in a test  
file, and then parse the debug output to get the desired probability.  
This would be complicated for higher-order ngrams since the output  
truncates the histories with "..."; plus this idea of parsing the  
output just seems really messy. Since I'm using the Google corpus  
with a count-based model, I don't think it's possible/feasible to  
write the model's probabilties to disk, but maybe there's a way  
around this using -limit-vocab.

So my question is:
Is there a direct way to query for a specific probability using one  
of the existing programs (i.e., to find P( is | my name), specify  
some options like -word "is" -history "my name")? Or is my only  
option to use the libraries to write my own tool for this purpose? If  
so, can you recommend an existing program that would be a good place  
to start?  What would be especially great is if I could request ngram  
probabilities as described here using the LM server options (i.e.,  
start the server and load the counts for some limited vocab, then  
have a client program that can make requests).

Thanks in advance!

- Andrew



More information about the SRILM-User mailing list