[SRILM User List] Limiting the vocabulary size of an n-gram model

Fri Feb 24 12:43:52 PST 2012

Greetings.

I am constructing a large trigram model using a pre-specified
vocabulary size.  What I have done in the past is to first get the
unigram counts, and then sort the top N most frequent words into my
vocabulary file, which I then pass to ngram for computing the trigram
counts, which I then pass again to ngram to construct the LM.

However, I seem to remember having read that the count of counts
estimates will be better if I compute the trigram counts first, and
only limit the vocabulary on the final step.  Is that correct?  Are
there any other shortcuts for this?

Thank you,
Amber

-- 
http://scholar.google.com/citations?user=15gGywMAAAAJ