ARPA format (sorting)

Paul Melis melis at
Tue Mar 11 14:21:59 PST 2003

Is there any explicit sorting that LM's in ARPA format should have? Specifically, is there a standard sort order for the words of uni-, bi- and trigrams? (e.g. <unk> first, then diacritics, then alphabetically, then...). 
We've had some problems with arpa's written by SRILM that the CMU toolkit can't handle, and we suspect a problem in the sorting of n-grams.

