stolcke at speech.sri.com
Wed Dec 18 22:21:20 PST 2002
In message <3E007101.7A413D18 at uni-mb.si>you wrote:
> I have the following problem.
> The n-gram counts are computed from raw text corpus by using
> 'ngram-count' and 'ngram-merge'.
> I experiment with different vocabularies and bigram and trigram models.
> In each experiment I run again 'ngram-count -vocab -order' and make the
> language model with ' make-big-lm -trust-totals'.
> I test language models on my test set and noticed some mistakes. Some
> bigrams, which are present in the bigram model get lost in the trigram
> model. When I omit the -trust-totals option, the results are OK.
> Why should I not trust the totals in my case? Are the counts of
> different orders made by 'ngram-count' and 'ngram-merge' not in line?
This is indeed a little strange. However, the -trust-totals option
is obsolete, as it does not interact well with some discounting
methods (e.g., KN). It was always a hack, and the latest version of
make-big-lm uses a different strategy for saving memory on ngrams discarded by
cutoffs (the ngram-count -meta-tag and -read-with-mincounts options,
see the man page).
Still, if you can reduce your problem to a small test case I could look
at it to understand exactly what's going on.
More information about the SRILM-User