[SRILM User List] Question about 3-gram Language Model with OOV triplets

Burkay Gur burkay at mit.edu
Tue Oct 25 12:41:43 PDT 2011


Hi,

I have just started using SRILM, and it is a great tool. But I ran 
across this issue. The situation is that I have:

corpusA.txt
corpusB.txt

What I want to do is create two different 3-gram language models for 
both corpora. But I want to make sure that if a triplet is non-existent 
in the other corpus, then a smoothed probability should be assigned to 
that. For example;

if corpusA has triplet counts:

this is a    1
is a test    1

and corpusB has triplet counts:

that is a    1
is a test    1

then the final counts for corpusA should be:

this is a    1
is a test    1
that is a    0

because "that is a" is in B but not A.

similarly corpusB should be:

that is a    1
is a test    1
this is a    0

After the counts are setup, some smoothing algorithm might be used. I 
have manually tried to make the triple word counts 0, however it does 
not seem to work. As they are omitted from 3-grams.

Can you recommend any other ways of doing this?

Thank you,
Burkay



More information about the SRILM-User mailing list