[SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets

Andreas Stolcke stolcke at icsi.berkeley.edu
Tue Oct 25 15:38:41 PDT 2011


Burkay Gur wrote:
> To follow up, basically, when I edit the .count file and add 0 counts 
> for some trigrams, they will not be included in the final .lm file, 
> when I try to read from the .count file and create a language model.
A zero  count is complete equivalent to a  non-existent count, so what 
you're seeing it expected.

It is not clear what precisely you want to happen.   As a result of 
discounting and backing off, your LM, even without the unobserved 
trigram, will already assign a non-zero probability to that trigram.  
That's exactly what the ngram smoothing algorithms are for.

If you want to inject some specific statistical information rom another 
dataset into your target LM you could interpolate (mix) the two LMs to 
obtain a third LM.   See the description of the ngram -mix-lm option.

Andreas

>
> On 10/25/11 3:41 PM, Burkay Gur wrote:
>> Hi,
>>
>> I have just started using SRILM, and it is a great tool. But I ran 
>> across this issue. The situation is that I have:
>>
>> corpusA.txt
>> corpusB.txt
>>
>> What I want to do is create two different 3-gram language models for 
>> both corpora. But I want to make sure that if a triplet is 
>> non-existent in the other corpus, then a smoothed probability should 
>> be assigned to that. For example;
>>
>> if corpusA has triplet counts:
>>
>> this is a    1
>> is a test    1
>>
>> and corpusB has triplet counts:
>>
>> that is a    1
>> is a test    1
>>
>> then the final counts for corpusA should be:
>>
>> this is a    1
>> is a test    1
>> that is a    0
>>
>> because "that is a" is in B but not A.
>>
>> similarly corpusB should be:
>>
>> that is a    1
>> is a test    1
>> this is a    0
>>
>> After the counts are setup, some smoothing algorithm might be used. I 
>> have manually tried to make the triple word counts 0, however it does 
>> not seem to work. As they are omitted from 3-grams.
>>
>> Can you recommend any other ways of doing this?
>>
>> Thank you,
>> Burkay
>>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user



More information about the SRILM-User mailing list