ARPA format (sorting)

Tue Mar 11 14:33:44 PST 2003

I'm not aware of any specific sorting requirements.  SRILM outputs the
N-grams in and order that optimizes memory caching behavior (essentially
by proximity in the underlying tree data structure), but of course it
can read N-grams in any order.

However, I have heard that some CMU software (like Sphinx) expects the
N-grams to be sorted lexicographically left-to-right.  The latest release
contains a script "sort-lm" that reorders the N-grams in a manner that
should be agreeable to the CMU software.  It is documented in the lm-scripts(1)
man page.

--Andreas

In message <20030311232159.A15739 at luistervink.cs.utwente.nl>you wrote:
> Hello Andreas,
> 
> Is there any explicit sorting that LM's in ARPA format should have? Specifica
> lly, is there a standard sort order for the words of uni-, bi- and trigrams? 
> (e.g. <unk> first, then diacritics, then alphabetically, then...). 
> We've had some problems with arpa's written by SRILM that the CMU toolkit can
> 't handle, and we suspect a problem in the sorting of n-grams.
> 
> Regards,
> Paul
> -- 
> melis at cs.utwente.nl