> I would like to know if it's possible with the SRILM toolkit to generate > a vocabulary with the 20000 most frequent words of a corpus for example. You should be able achieve this by using "ngram-count -order 1 -write -", doing reverse sort on field 2 and taking the top 20000. &