Querying count-based LM for specific n-gram probabilities

Andreas Stolcke stolcke at speech.sri.com
Sat Apr 5 21:52:22 PDT 2008


Have a look at the ngram -counts option.

--Andreas

In message <55141B22-A42B-482C-A8B5-D9608AC6CE7E at cs.wisc.edu>you wrote:
> Dear list,
> 
> I am using the Google 1T ngram corpus, and have successfully built a  
> count-based LM as per the instructions on the FAQ. Thanks for those  
> tips to get started! I have also been able to compute perplexities  
> for test sentences using the -ppl option of the ngram program, and  
> got this working with the newer server options, too! Very cool.
> 
> However, what I really want to do is to be able to retrieve just the  
> probabilities for particular n-grams to use them in another  
> application. In other words, given a word and a history (say, words  
> h1 h2 h3 h4), I would like to know the LM's probability P( word | h1  
> h2 h3 ), after taking into account interpolation, etc. I know one  
> hack-ish way to do this would be to put "h1 h2 h3 h4 w" in a test  
> file, and then parse the debug output to get the desired probability.  
> This would be complicated for higher-order ngrams since the output  
> truncates the histories with "..."; plus this idea of parsing the  
> output just seems really messy. Since I'm using the Google corpus  
> with a count-based model, I don't think it's possible/feasible to  
> write the model's probabilties to disk, but maybe there's a way  
> around this using -limit-vocab.
> 
> So my question is:
> Is there a direct way to query for a specific probability using one  
> of the existing programs (i.e., to find P( is | my name), specify  
> some options like -word "is" -history "my name")? Or is my only  
> option to use the libraries to write my own tool for this purpose? If  
> so, can you recommend an existing program that would be a good place  
> to start?  What would be especially great is if I could request ngram  
> probabilities as described here using the LM server options (i.e.,  
> start the server and load the counts for some limited vocab, then  
> have a client program that can make requests).
> 
> Thanks in advance!
> 
> - Andrew




More information about the SRILM-User mailing list