[SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets

Tue Oct 25 19:53:06 PDT 2011

But we have not even added "this is not" into the language model yet. If it is not a hard task, can you write a sample to show me how this works?

On Oct 25, 2011, at 10:10 PM, Andreas Stolcke <stolcke at icsi.berkeley.edu> wrote:

> Burkay Gur wrote:
>> thank you, i understand that. but the problem is, like you said, how do we introduce these "unobserved trigrams" into the language model. i ll give another example if it helps:
>> 
>> say you have this test.count file:
>> 
>> 1-gram
>> this
>> is
>> a
>> test
>> 
>> 2-gram
>> this is
>> is a
>> a test
>> 
>> 3-gram
>> this is a
>> is a test
>> 
>> then, say you want to extend this language model with this trigram:
>> 
>> "this is not"
>> 
>> which basically has no previous count. and without smoothing in the 3-gram model, it will have zero probability. but how do we make sure that the smooth language model has a non-zero probability for this additional trigram?
>> 
>> i thought i could do this my manually by updating the test.count with "this is not" with count 0. but apparently this is not working..
> The smoothed 3gram LM will have a non-zero probability, for ALL trigrams, trust me ;-)
> 
> Try
>   echo "this is not"  | ngram -lm LM -ppl - -debug 2
> 
> to see it in action.
> 
> Andreas
> 
>> 
>> On 10/25/11 6:38 PM, Andreas Stolcke wrote:
>>> Burkay Gur wrote:
>>>> To follow up, basically, when I edit the .count file and add 0 counts for some trigrams, they will not be included in the final .lm file, when I try to read from the .count file and create a language model.
>>> A zero  count is complete equivalent to a  non-existent count, so what you're seeing it expected.
>>> 
>>> It is not clear what precisely you want to happen.   As a result of discounting and backing off, your LM, even without the unobserved trigram, will already assign a non-zero probability to that trigram.  That's exactly what the ngram smoothing algorithms are for.
>>> 
>>> If you want to inject some specific statistical information rom another dataset into your target LM you could interpolate (mix) the two LMs to obtain a third LM.   See the description of the ngram -mix-lm option.
>>> 
>>> Andreas
>>> 
>>>> 
>>>> On 10/25/11 3:41 PM, Burkay Gur wrote:
>>>>> Hi,
>>>>> 
>>>>> I have just started using SRILM, and it is a great tool. But I ran across this issue. The situation is that I have:
>>>>> 
>>>>> corpusA.txt
>>>>> corpusB.txt
>>>>> 
>>>>> What I want to do is create two different 3-gram language models for both corpora. But I want to make sure that if a triplet is non-existent in the other corpus, then a smoothed probability should be assigned to that. For example;
>>>>> 
>>>>> if corpusA has triplet counts:
>>>>> 
>>>>> this is a    1
>>>>> is a test    1
>>>>> 
>>>>> and corpusB has triplet counts:
>>>>> 
>>>>> that is a    1
>>>>> is a test    1
>>>>> 
>>>>> then the final counts for corpusA should be:
>>>>> 
>>>>> this is a    1
>>>>> is a test    1
>>>>> that is a    0
>>>>> 
>>>>> because "that is a" is in B but not A.
>>>>> 
>>>>> similarly corpusB should be:
>>>>> 
>>>>> that is a    1
>>>>> is a test    1
>>>>> this is a    0
>>>>> 
>>>>> After the counts are setup, some smoothing algorithm might be used. I have manually tried to make the triple word counts 0, however it does not seem to work. As they are omitted from 3-grams.
>>>>> 
>>>>> Can you recommend any other ways of doing this?
>>>>> 
>>>>> Thank you,
>>>>> Burkay
>>>>> 
>>>> 
>>>> _______________________________________________
>>>> SRILM-User site list
>>>> SRILM-User at speech.sri.com
>>>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>>> 
>> 
>