[SRILM User List] Understanding what the values in .arpa files represent

Wed Aug 14 14:41:19 PDT 2013

I would like to get a good understanding of what the values in .arpa files represent so I can do a better job on a project I am working on. I have found some documentation about .arpa files on the SRILM web site as well as in some other places that describe the values in the first column of the "\n-grams" sections of the file as conditional probabilities. 

I assumed from this that if I had an .arpa file containing all of the unigrams and bigrams of a corpus, that [1] for all unigrams, the sum of 10^unigram_value would equal 1.0 and [2] for all bigrams, the sum of (10^bigram_value * 10^unigram_value_of_first_term_in_bigram) would also equal 1.0, since the joint probability p(a, b) = p(b|a) * p(a). It turns out that [1] is true, but for the .arpa file I have been working with, the [2] sum is about .68. I was expecting that [2] might sum to something less than 1.0 to due to probability mass redistributed for smoothing purposes, but that wouldn't account for .32 of the total, would it? 

I think it's more likely that I don't understand what the values in the left column represent in the "\n-grams" sections for n >= 2. Is there a way to use the values in an .arpa file to reconstruct joint probabilities for bigrams (and other higher order n-grams) in order to verify that they actually do sum to 1.0 for each "\n-grams" section in the file? 
Thanks, 
Mike 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130814/4cd99b33/attachment.html>