[SRILM User List] Follow Up: Question about 3-gram Language Model with OOV triplets

Tue Oct 25 20:54:41 PDT 2011

In message <50DBF2C0-634E-4391-8379-FD5017CF198E at mit.edu>you wrote:
> But we have not even added "this is not" into the language model yet. If it is not a hard task, can you write a sample to show me h
> ow this works?

There no need to "add" this trigram to the LM.  It can compute a non-zero probability 
for it even if it hasn't occurred in the training data.

I suggest you review the basics of N-gram LM smoothing as described in the two text book chapters
referenced at http://www.speech.sri.com/projects/srilm/ .

Andreas 

> 
> On Oct 25, 2011, at 10:10 PM, Andreas Stolcke <stolcke at icsi.berkeley.edu> wrote:
> 
> > Burkay Gur wrote:
> >> thank you, i understand that. but the problem is, like you said, how do we introduce these "unobserved trigrams" into the langua
> ge model. i ll give another example if it helps:
> >> 
> >> say you have this test.count file:
> >> 
> >> 1-gram
> >> this
> >> is
> >> a
> >> test
> >> 
> >> 2-gram
> >> this is
> >> is a
> >> a test
> >> 
> >> 3-gram
> >> this is a
> >> is a test
> >> 
> >> then, say you want to extend this language model with this trigram:
> >> 
> >> "this is not"
> >> 
> >> which basically has no previous count. and without smoothing in the 3-gram model, it will have zero probability. but how do we m
> ake sure that the smooth language model has a non-zero probability for this additional trigram?
> >> 
> >> i thought i could do this my manually by updating the test.count with "this is not" with count 0. but apparently this is not wor
> king..
> > The smoothed 3gram LM will have a non-zero probability, for ALL trigrams, trust me ;-)
> > 
> > Try
> >   echo "this is not"  | ngram -lm LM -ppl - -debug 2
> > 
> > to see it in action.
> > 
> > Andreas
> > 
> >> 
> >> On 10/25/11 6:38 PM, Andreas Stolcke wrote:
> >>> Burkay Gur wrote:
> >>>> To follow up, basically, when I edit the .count file and add 0 counts for some trigrams, they will not be included in the fina
> l .lm file, when I try to read from the .count file and create a language model.
> >>> A zero  count is complete equivalent to a  non-existent count, so what you're seeing it expected.
> >>> 
> >>> It is not clear what precisely you want to happen.   As a result of discounting and backing off, your LM, even without the unob
> served trigram, will already assign a non-zero probability to that trigram.  That's exactly what the ngram smoothing algorithms are
>  for.
> >>> 
> >>> If you want to inject some specific statistical information rom another dataset into your target LM you could interpolate (mix)
>  the two LMs to obtain a third LM.   See the description of the ngram -mix-lm option.
> >>> 
> >>> Andreas
> >>> 
> >>>> 
> >>>> On 10/25/11 3:41 PM, Burkay Gur wrote:
> >>>>> Hi,
> >>>>> 
> >>>>> I have just started using SRILM, and it is a great tool. But I ran across this issue. The situation is that I have:
> >>>>> 
> >>>>> corpusA.txt
> >>>>> corpusB.txt
> >>>>> 
> >>>>> What I want to do is create two different 3-gram language models for both corpora. But I want to make sure that if a triplet 
> is non-existent in the other corpus, then a smoothed probability should be assigned to that. For example;
> >>>>> 
> >>>>> if corpusA has triplet counts:
> >>>>> 
> >>>>> this is a    1
> >>>>> is a test    1
> >>>>> 
> >>>>> and corpusB has triplet counts:
> >>>>> 
> >>>>> that is a    1
> >>>>> is a test    1
> >>>>> 
> >>>>> then the final counts for corpusA should be:
> >>>>> 
> >>>>> this is a    1
> >>>>> is a test    1
> >>>>> that is a    0
> >>>>> 
> >>>>> because "that is a" is in B but not A.
> >>>>> 
> >>>>> similarly corpusB should be:
> >>>>> 
> >>>>> that is a    1
> >>>>> is a test    1
> >>>>> this is a    0
> >>>>> 
> >>>>> After the counts are setup, some smoothing algorithm might be used. I have manually tried to make the triple word counts 0, h
> owever it does not seem to work. As they are omitted from 3-grams.
> >>>>> 
> >>>>> Can you recommend any other ways of doing this?
> >>>>> 
> >>>>> Thank you,
> >>>>> Burkay
> >>>>> 
> >>>> 
> >>>> _______________________________________________
> >>>> SRILM-User site list
> >>>> SRILM-User at speech.sri.com
> >>>> http://www.speech.sri.com/mailman/listinfo/srilm-user
> >>> 
> >> 
> > 
> 

--Andreas