Querying count-based LM for specific n-gram probabilities
Andrew Goldberg
goldberg at cs.wisc.edu
Sat Apr 5 21:15:09 PDT 2008
Dear list,
I am using the Google 1T ngram corpus, and have successfully built a
count-based LM as per the instructions on the FAQ. Thanks for those
tips to get started! I have also been able to compute perplexities
for test sentences using the -ppl option of the ngram program, and
got this working with the newer server options, too! Very cool.
However, what I really want to do is to be able to retrieve just the
probabilities for particular n-grams to use them in another
application. In other words, given a word and a history (say, words
h1 h2 h3 h4), I would like to know the LM's probability P( word | h1
h2 h3 ), after taking into account interpolation, etc. I know one
hack-ish way to do this would be to put "h1 h2 h3 h4 w" in a test
file, and then parse the debug output to get the desired probability.
This would be complicated for higher-order ngrams since the output
truncates the histories with "..."; plus this idea of parsing the
output just seems really messy. Since I'm using the Google corpus
with a count-based model, I don't think it's possible/feasible to
write the model's probabilties to disk, but maybe there's a way
around this using -limit-vocab.
So my question is:
Is there a direct way to query for a specific probability using one
of the existing programs (i.e., to find P( is | my name), specify
some options like -word "is" -history "my name")? Or is my only
option to use the libraries to write my own tool for this purpose? If
so, can you recommend an existing program that would be a good place
to start? What would be especially great is if I could request ngram
probabilities as described here using the LM server options (i.e.,
start the server and load the counts for some limited vocab, then
have a client program that can make requests).
Thanks in advance!
- Andrew
More information about the SRILM-User
mailing list