[SRILM User List] Understanding what the values in .arpa files represent
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Aug 15 17:17:41 PDT 2013
On 8/14/2013 2:41 PM, tm-oleary at comcast.net wrote:
> I would like to get a good understanding of what the values in .arpa
> files represent so I can do a better job on a project I am working on.
> I have found some documentation about .arpa files on the SRILM web
> site as well as in some other places that describe the values in the
> first column of the "\n-grams" sections of the file as conditional
> probabilities.
>
> I assumed from this that if I had an .arpa file containing all of the
> unigrams and bigrams of a corpus, that [1] for all unigrams, the sum
> of 10^unigram_value would equal 1.0 and [2] for all bigrams, the sum
> of (10^bigram_value * 10^unigram_value_of_first_term_in_bigram) would
> also equal 1.0, since the joint probability p(a, b) = p(b|a) * p(a).
> It turns out that [1] is true, but for the .arpa file I have been
> working with, the [2] sum is about .68. I was expecting that [2] might
> sum to something less than 1.0 to due to probability mass
> redistributed for smoothing purposes, but that wouldn't account for
> .32 of the total, would it?
You assume that the LM contains all possible N-grams of a given order
(in your case, all bigrams). That is not true. It only lists the
N-grams that occur in the training data, and that occur frequently
enough (subject to the -gtNmin parameters). The probabilities of
unlisted N-grams are computed by backoff. For an explanation search for
"backoff computation language model".
So if you summed over all possible bigrams then you should get the sum =
1 as you expect.
>
> I think it's more likely that I don't understand what the values in
> the left column represent in the "\n-grams" sections for n >= 2. Is
> there a way to use the values in an .arpa file to reconstruct joint
> probabilities for bigrams (and other higher order n-grams) in order to
> verify that they actually do sum to 1.0 for each "\n-grams" section in
> the file?
You are assuming above that the first column contains conditional ngram
log probabilities, and that is correct.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130815/ae38008a/attachment.html>
More information about the SRILM-User
mailing list