[SRILM User List] Limiting the vocabulary size of an n-gram model
stolcke at icsi.berkeley.edu
Fri Feb 24 13:16:55 PST 2012
On 2/24/2012 12:43 PM, L. Amber Wilcox-O'Hearn wrote:
> I am constructing a large trigram model using a pre-specified
> vocabulary size. What I have done in the past is to first get the
> unigram counts, and then sort the top N most frequent words into my
> vocabulary file, which I then pass to ngram for computing the trigram
> counts, which I then pass again to ngram to construct the LM.
> However, I seem to remember having read that the count of counts
> estimates will be better if I compute the trigram counts first, and
> only limit the vocabulary on the final step. Is that correct? Are
> there any other shortcuts for this?
This is correct. The make-big-lm script (a wrapper around ngram-count)
will extract the discounting statistics from the full vocabulary and
them apply them to the LM estimation with a limited vocabulary. Check
the training-scripts(1) man page.
More information about the SRILM-User