[SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets

Tue Oct 25 15:16:49 PDT 2011

To follow up, basically, when I edit the .count file and add 0 counts 
for some trigrams, they will not be included in the final .lm file, when 
I try to read from the .count file and create a language model.

On 10/25/11 3:41 PM, Burkay Gur wrote:
> Hi,
>
> I have just started using SRILM, and it is a great tool. But I ran 
> across this issue. The situation is that I have:
>
> corpusA.txt
> corpusB.txt
>
> What I want to do is create two different 3-gram language models for 
> both corpora. But I want to make sure that if a triplet is 
> non-existent in the other corpus, then a smoothed probability should 
> be assigned to that. For example;
>
> if corpusA has triplet counts:
>
> this is a    1
> is a test    1
>
> and corpusB has triplet counts:
>
> that is a    1
> is a test    1
>
> then the final counts for corpusA should be:
>
> this is a    1
> is a test    1
> that is a    0
>
> because "that is a" is in B but not A.
>
> similarly corpusB should be:
>
> that is a    1
> is a test    1
> this is a    0
>
> After the counts are setup, some smoothing algorithm might be used. I 
> have manually tried to make the triple word counts 0, however it does 
> not seem to work. As they are omitted from 3-grams.
>
> Can you recommend any other ways of doing this?
>
> Thank you,
> Burkay
>