missing counts

Fri Dec 20 00:54:21 PST 2002

--Andreas

In message <3E02D0F1.FF6EBC23 at uni-mb.si>you wrote:
> This is a multi-part message in MIME format.
> 
> --Boundary_(ID_pd4a/8W91VuCtRvCI8wYoA)
> Content-type: text/plain; charset=us-ascii
> Content-transfer-encoding: 7BIT
> 
> Andreas Stolcke wrote:
> 
> > In message <3E007101.7A413D18 at uni-mb.si>you wrote:
> > >
> > > Hi,
> > >
> > > I have the following problem.
> > >
> > > The n-gram counts are computed from raw text corpus by using
> > > 'ngram-count' and  'ngram-merge'.
> > > I experiment with different vocabularies and bigram and trigram models.
> > > In each experiment I run again 'ngram-count -vocab -order' and make the
> > > language model with ' make-big-lm -trust-totals'.
> > > I test language models on my test set and noticed some mistakes. Some
> > > bigrams, which are present in the bigram model get lost in the trigram
> > > model. When I omit the -trust-totals option, the results are OK.
> > > Why should I not trust the totals in my case?  Are the counts of
> > > different orders made by 'ngram-count' and  'ngram-merge' not in line?
> > >
> > > Regards,
> > >
> > > Mirjam.
> >
> > This is indeed a little strange. However, the -trust-totals option
> > is obsolete, as it does not interact well with some discounting
> > methods (e.g., KN).  It was always a hack, and the latest version of
> > make-big-lm uses a different strategy for saving memory on ngrams discarded
>  by
> > cutoffs (the ngram-count -meta-tag and -read-with-mincounts options,
> > see the man page).
> >
> > Still, if you can reduce your problem to a small test case I could look
> > at it to understand exactly what's going on.
> >
> > --Andreas
> 
> Thank you for answering so quick.
> You are right. I used KN discounting.  I see, it's time to switch from the
> version 1.3.1 to 1.3.2.
> I will report the results.

And of course KN discounting modifies the lower-order counts, so at a given
cutoff > 1 you might lose ngrams because after the KN method is applied 
the counts below the cutoff.  this is consistent with your observation
that a bigram is not in the trigram model while it is in the bigram model.

--Andreas