[SRILM User List] Limiting the vocabulary size of an n-gram model

Fri Feb 24 13:16:55 PST 2012

On 2/24/2012 12:43 PM, L. Amber Wilcox-O'Hearn wrote:
> Greetings.
>
> I am constructing a large trigram model using a pre-specified
> vocabulary size.  What I have done in the past is to first get the
> unigram counts, and then sort the top N most frequent words into my
> vocabulary file, which I then pass to ngram for computing the trigram
> counts, which I then pass again to ngram to construct the LM.
>
> However, I seem to remember having read that the count of counts
> estimates will be better if I compute the trigram counts first, and
> only limit the vocabulary on the final step.  Is that correct?  Are
> there any other shortcuts for this?

This is correct.  The make-big-lm script (a wrapper around ngram-count) 
will extract the discounting statistics from the full vocabulary and 
them apply them to the LM estimation with a limited vocabulary.  Check 
the training-scripts(1) man page.

Andreas