[SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Oct 25 19:10:40 PDT 2011
Burkay Gur wrote:
> thank you, i understand that. but the problem is, like you said, how
> do we introduce these "unobserved trigrams" into the language model. i
> ll give another example if it helps:
>
> say you have this test.count file:
>
> 1-gram
> this
> is
> a
> test
>
> 2-gram
> this is
> is a
> a test
>
> 3-gram
> this is a
> is a test
>
> then, say you want to extend this language model with this trigram:
>
> "this is not"
>
> which basically has no previous count. and without smoothing in the
> 3-gram model, it will have zero probability. but how do we make sure
> that the smooth language model has a non-zero probability for this
> additional trigram?
>
> i thought i could do this my manually by updating the test.count with
> "this is not" with count 0. but apparently this is not working..
The smoothed 3gram LM will have a non-zero probability, for ALL
trigrams, trust me ;-)
Try
echo "this is not" | ngram -lm LM -ppl - -debug 2
to see it in action.
Andreas
>
> On 10/25/11 6:38 PM, Andreas Stolcke wrote:
>> Burkay Gur wrote:
>>> To follow up, basically, when I edit the .count file and add 0
>>> counts for some trigrams, they will not be included in the final .lm
>>> file, when I try to read from the .count file and create a language
>>> model.
>> A zero count is complete equivalent to a non-existent count, so
>> what you're seeing it expected.
>>
>> It is not clear what precisely you want to happen. As a result of
>> discounting and backing off, your LM, even without the unobserved
>> trigram, will already assign a non-zero probability to that trigram.
>> That's exactly what the ngram smoothing algorithms are for.
>>
>> If you want to inject some specific statistical information rom
>> another dataset into your target LM you could interpolate (mix) the
>> two LMs to obtain a third LM. See the description of the ngram
>> -mix-lm option.
>>
>> Andreas
>>
>>>
>>> On 10/25/11 3:41 PM, Burkay Gur wrote:
>>>> Hi,
>>>>
>>>> I have just started using SRILM, and it is a great tool. But I ran
>>>> across this issue. The situation is that I have:
>>>>
>>>> corpusA.txt
>>>> corpusB.txt
>>>>
>>>> What I want to do is create two different 3-gram language models
>>>> for both corpora. But I want to make sure that if a triplet is
>>>> non-existent in the other corpus, then a smoothed probability
>>>> should be assigned to that. For example;
>>>>
>>>> if corpusA has triplet counts:
>>>>
>>>> this is a 1
>>>> is a test 1
>>>>
>>>> and corpusB has triplet counts:
>>>>
>>>> that is a 1
>>>> is a test 1
>>>>
>>>> then the final counts for corpusA should be:
>>>>
>>>> this is a 1
>>>> is a test 1
>>>> that is a 0
>>>>
>>>> because "that is a" is in B but not A.
>>>>
>>>> similarly corpusB should be:
>>>>
>>>> that is a 1
>>>> is a test 1
>>>> this is a 0
>>>>
>>>> After the counts are setup, some smoothing algorithm might be used.
>>>> I have manually tried to make the triple word counts 0, however it
>>>> does not seem to work. As they are omitted from 3-grams.
>>>>
>>>> Can you recommend any other ways of doing this?
>>>>
>>>> Thank you,
>>>> Burkay
>>>>
>>>
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>>
>
More information about the SRILM-User
mailing list