[SRILM User List] Limiting the vocabulary size of an n-gram model
L. Amber Wilcox-O'Hearn
amber.wilcox.ohearn at gmail.com
Fri Feb 24 12:43:52 PST 2012
I am constructing a large trigram model using a pre-specified
vocabulary size. What I have done in the past is to first get the
unigram counts, and then sort the top N most frequent words into my
vocabulary file, which I then pass to ngram for computing the trigram
counts, which I then pass again to ngram to construct the LM.
However, I seem to remember having read that the count of counts
estimates will be better if I compute the trigram counts first, and
only limit the vocabulary on the final step. Is that correct? Are
there any other shortcuts for this?
More information about the SRILM-User