[SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets

Wed Oct 26 15:52:33 PDT 2011

Burkay Gur wrote:
> Try
>    echo "this is not"  | ngram -lm LM -ppl - -debug 2
>
>
> ok, this returns a non-zero probability. but i want to now include 
> "this is not" in the language model. and still have all the 
> probabilities in the language model sum up to 1.
>
> in other words i want to expand my language model with multiple 
> tri-grams that are unseen events.
>
> maybe if i tell you the main reason why i want to do this it will be 
> more clear.
>
> i am trying to find the symmetric KL Divergence of two distributions. 
> and these two distributions will be two language models.
>
> the formula for symmetric KL divergence is:
>
> i being all trigrams in both models:
>
> sum[ p(i) *  (log(p(i)) / log(q(i)))  ]  + sum[ q(i) *  (log(q(i)) / 
> log(p(i)))  ]
>
> sums are over all i's.
>
> p(i) is the probability in language model 1. and q(i) is the 
> probability in language model 2.
>
> since we are doing this over all i's, it means we have to include the 
> probabilities of trigrams that occur in one LM and no the other in 
> that particular LM. otherwise we will get log(0) error. so we will 
> need some kind of smoothing.
But you don't get log(0) because the LM is smoothed and therefore the 
p's and q's are all > 0.
BTW, you only get a problem when the term in the denominator is 
undefined, because 0 * log(0) = 0.

So you can sum over the UNION of all ngrams in both models, and when you 
need to compute the p(i) or q(i) for an ngram that is not in the 
particular model you use the backoff estimate (i.e., just what SRILM 
will compute when you ask it to compute a probability that is not 
explicitly represented in the model).

BTW, for this type of thing you wants to use ngram -counts , and then 
postprocess the output.

Andreas

>
> say LM1 has these trigrams:
>
> a  1/3
> b  1/3
> c  1/3
>
> and LM2 has these:
>
> a  1/2
> d  1/2
>
> now when we re doing the KL divergence calculation, we need to make 
> sure "d" is in LM1, and also "b" and "c" are in LM2. otherwise we ll 
> get log(0). so we ll need to modify LM1 and LM2 by smoothing, so they 
> include the non-zero probabilities for b,c and d. and still each sum 
> up to 1 with their probabilities.
>
> if we use the test-training approach, and try to see the probabilities 
> of unseen events, we are not updating out current LM to include those 
> unseen events. in fact that is what i want to do. include a list of 
> unseen trigrams, (that might might possibly have lower orders of 
> n-grams in the model) in that language model.
>
>
> On 10/25/11 10:10 PM, Andreas Stolcke wrote:
>> Burkay Gur wrote:
>>> thank you, i understand that. but the problem is, like you said, how 
>>> do we introduce these "unobserved trigrams" into the language model. 
>>> i ll give another example if it helps:
>>>
>>> say you have this test.count file:
>>>
>>> 1-gram
>>> this
>>> is
>>> a
>>> test
>>>
>>> 2-gram
>>> this is
>>> is a
>>> a test
>>>
>>> 3-gram
>>> this is a
>>> is a test
>>>
>>> then, say you want to extend this language model with this trigram:
>>>
>>> "this is not"
>>>
>>> which basically has no previous count. and without smoothing in the 
>>> 3-gram model, it will have zero probability. but how do we make sure 
>>> that the smooth language model has a non-zero probability for this 
>>> additional trigram?
>>>
>>> i thought i could do this my manually by updating the test.count 
>>> with "this is not" with count 0. but apparently this is not working..
>> The smoothed 3gram LM will have a non-zero probability, for ALL 
>> trigrams, trust me ;-)
>>
>> Try
>>    echo "this is not"  | ngram -lm LM -ppl - -debug 2
>>
>> to see it in action.
>>
>> Andreas
>>
>>>
>>> On 10/25/11 6:38 PM, Andreas Stolcke wrote:
>>>> Burkay Gur wrote:
>>>>> To follow up, basically, when I edit the .count file and add 0 
>>>>> counts for some trigrams, they will not be included in the final 
>>>>> .lm file, when I try to read from the .count file and create a 
>>>>> language model.
>>>> A zero  count is complete equivalent to a  non-existent count, so 
>>>> what you're seeing it expected.
>>>>
>>>> It is not clear what precisely you want to happen.   As a result of 
>>>> discounting and backing off, your LM, even without the unobserved 
>>>> trigram, will already assign a non-zero probability to that 
>>>> trigram.  That's exactly what the ngram smoothing algorithms are for.
>>>>
>>>> If you want to inject some specific statistical information rom 
>>>> another dataset into your target LM you could interpolate (mix) the 
>>>> two LMs to obtain a third LM.   See the description of the ngram 
>>>> -mix-lm option.
>>>>
>>>> Andreas
>>>>
>>>>>
>>>>> On 10/25/11 3:41 PM, Burkay Gur wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I have just started using SRILM, and it is a great tool. But I 
>>>>>> ran across this issue. The situation is that I have:
>>>>>>
>>>>>> corpusA.txt
>>>>>> corpusB.txt
>>>>>>
>>>>>> What I want to do is create two different 3-gram language models 
>>>>>> for both corpora. But I want to make sure that if a triplet is 
>>>>>> non-existent in the other corpus, then a smoothed probability 
>>>>>> should be assigned to that. For example;
>>>>>>
>>>>>> if corpusA has triplet counts:
>>>>>>
>>>>>> this is a    1
>>>>>> is a test    1
>>>>>>
>>>>>> and corpusB has triplet counts:
>>>>>>
>>>>>> that is a    1
>>>>>> is a test    1
>>>>>>
>>>>>> then the final counts for corpusA should be:
>>>>>>
>>>>>> this is a    1
>>>>>> is a test    1
>>>>>> that is a    0
>>>>>>
>>>>>> because "that is a" is in B but not A.
>>>>>>
>>>>>> similarly corpusB should be:
>>>>>>
>>>>>> that is a    1
>>>>>> is a test    1
>>>>>> this is a    0
>>>>>>
>>>>>> After the counts are setup, some smoothing algorithm might be 
>>>>>> used. I have manually tried to make the triple word counts 0, 
>>>>>> however it does not seem to work. As they are omitted from 3-grams.
>>>>>>
>>>>>> Can you recommend any other ways of doing this?
>>>>>>
>>>>>> Thank you,
>>>>>> Burkay
>>>>>>
>>>>>
>>>>> _______________________________________________
>>>>> SRILM-User site list
>>>>> SRILM-User at speech.sri.com
>>>>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>>>>
>>>
>>
>