question about vocabulary
Andreas Stolcke
stolcke at speech.sri.com
Tue May 4 08:57:23 PDT 2004
In message <4097A623.1E8E3B88 at loria.fr>you wrote:
> Hello everybody,
>
> I would like to know if it's possible with the SRILM toolkit to generate
> a vocabulary with the 20000 most frequent words of a corpus for example.
>
> I know that with -write-vocab in the ngram-count function I can
> generate a vocabulary but only with all the words of the corpus.
How about this:
ngram-count -order 1 -text CORPUS -write - | \
sort +1rn -2 | awk 'NR <= 20000 { print $1 }' > top20000.vocab
--Andreas
More information about the SRILM-User
mailing list