[SRILM User List] Understanding what the values in .arpa files represent

Thu Aug 15 17:17:41 PDT 2013

On 8/14/2013 2:41 PM, tm-oleary at comcast.net wrote:
> I would like to get a good understanding of what the values in .arpa 
> files represent so I can do a better job on a project I am working on. 
> I have found some documentation about .arpa files on the SRILM web 
> site as well as in some other places that describe the values in the 
> first column of the "\n-grams" sections of the file as conditional 
> probabilities.
>
> I assumed from this that if I had an .arpa file containing all of the 
> unigrams and bigrams of a corpus, that [1] for all unigrams, the sum 
> of 10^unigram_value would equal 1.0 and [2] for all bigrams, the sum 
> of (10^bigram_value * 10^unigram_value_of_first_term_in_bigram) would 
> also equal 1.0, since the joint probability p(a, b) = p(b|a) * p(a). 
> It turns out that [1] is true, but for the .arpa file I have been 
> working with, the [2] sum is about .68. I was expecting that [2] might 
> sum to something less than 1.0 to due to probability mass 
> redistributed for smoothing purposes, but that wouldn't account for 
> .32 of the total, would it?
You assume that the LM contains all possible N-grams of a given order 
(in your case, all bigrams).   That is not true.   It only lists the 
N-grams that occur in the training data, and that occur frequently 
enough (subject to the -gtNmin parameters).  The probabilities of 
unlisted N-grams are computed by backoff.  For an explanation search for 
"backoff computation language model".

So if you summed over all possible bigrams then you should get the sum = 
1 as you expect.
>
> I think it's more likely that I don't understand what the values in 
> the left column represent in the "\n-grams" sections for n >= 2. Is 
> there a way to use the values in an .arpa file to reconstruct joint 
> probabilities for bigrams (and other higher order n-grams) in order to 
> verify that they actually do sum to 1.0 for each "\n-grams" section in 
> the file?
You are assuming above that the first column contains conditional ngram 
log probabilities, and that is correct.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130815/ae38008a/attachment.html>