question about vocabulary 
    Andreas Stolcke 
    stolcke at speech.sri.com
       
    Tue May  4 08:57:23 PDT 2004
    
    
  
In message <4097A623.1E8E3B88 at loria.fr>you wrote:
> Hello everybody,
> 
> I would like to know if it's possible with the SRILM toolkit to generate
> a vocabulary with the 20000 most frequent words of a corpus for example. 
> 
> I know that with -write-vocab  in the ngram-count function I can
> generate a vocabulary but only with all the words of the corpus.
How about this:
ngram-count -order 1 -text CORPUS -write - | \
sort +1rn -2 | awk 'NR <= 20000 { print $1 }' > top20000.vocab
--Andreas 
    
    
More information about the SRILM-User
mailing list