[SRILM User List] Limiting the vocabulary size of an n-gram model

L. Amber Wilcox-O'Hearn amber.wilcox.ohearn at gmail.com
Fri Feb 24 12:43:52 PST 2012


I am constructing a large trigram model using a pre-specified
vocabulary size.  What I have done in the past is to first get the
unigram counts, and then sort the top N most frequent words into my
vocabulary file, which I then pass to ngram for computing the trigram
counts, which I then pass again to ngram to construct the LM.

However, I seem to remember having read that the count of counts
estimates will be better if I compute the trigram counts first, and
only limit the vocabulary on the final step.  Is that correct?  Are
there any other shortcuts for this?

Thank you,


More information about the SRILM-User mailing list