[SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets
Burkay Gur
burkay at mit.edu
Tue Oct 25 17:29:35 PDT 2011
thank you, i understand that. but the problem is, like you said, how do
we introduce these "unobserved trigrams" into the language model. i ll
give another example if it helps:
say you have this test.count file:
1-gram
this
is
a
test
2-gram
this is
is a
a test
3-gram
this is a
is a test
then, say you want to extend this language model with this trigram:
"this is not"
which basically has no previous count. and without smoothing in the
3-gram model, it will have zero probability. but how do we make sure
that the smooth language model has a non-zero probability for this
additional trigram?
i thought i could do this my manually by updating the test.count with
"this is not" with count 0. but apparently this is not working..
On 10/25/11 6:38 PM, Andreas Stolcke wrote:
> Burkay Gur wrote:
>> To follow up, basically, when I edit the .count file and add 0 counts
>> for some trigrams, they will not be included in the final .lm file,
>> when I try to read from the .count file and create a language model.
> A zero count is complete equivalent to a non-existent count, so what
> you're seeing it expected.
>
> It is not clear what precisely you want to happen. As a result of
> discounting and backing off, your LM, even without the unobserved
> trigram, will already assign a non-zero probability to that trigram.
> That's exactly what the ngram smoothing algorithms are for.
>
> If you want to inject some specific statistical information rom
> another dataset into your target LM you could interpolate (mix) the
> two LMs to obtain a third LM. See the description of the ngram
> -mix-lm option.
>
> Andreas
>
>>
>> On 10/25/11 3:41 PM, Burkay Gur wrote:
>>> Hi,
>>>
>>> I have just started using SRILM, and it is a great tool. But I ran
>>> across this issue. The situation is that I have:
>>>
>>> corpusA.txt
>>> corpusB.txt
>>>
>>> What I want to do is create two different 3-gram language models for
>>> both corpora. But I want to make sure that if a triplet is
>>> non-existent in the other corpus, then a smoothed probability should
>>> be assigned to that. For example;
>>>
>>> if corpusA has triplet counts:
>>>
>>> this is a 1
>>> is a test 1
>>>
>>> and corpusB has triplet counts:
>>>
>>> that is a 1
>>> is a test 1
>>>
>>> then the final counts for corpusA should be:
>>>
>>> this is a 1
>>> is a test 1
>>> that is a 0
>>>
>>> because "that is a" is in B but not A.
>>>
>>> similarly corpusB should be:
>>>
>>> that is a 1
>>> is a test 1
>>> this is a 0
>>>
>>> After the counts are setup, some smoothing algorithm might be used.
>>> I have manually tried to make the triple word counts 0, however it
>>> does not seem to work. As they are omitted from 3-grams.
>>>
>>> Can you recommend any other ways of doing this?
>>>
>>> Thank you,
>>> Burkay
>>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
More information about the SRILM-User
mailing list