ARPA format (sorting)
stolcke at speech.sri.com
Tue Mar 11 14:33:44 PST 2003
I'm not aware of any specific sorting requirements. SRILM outputs the
N-grams in and order that optimizes memory caching behavior (essentially
by proximity in the underlying tree data structure), but of course it
can read N-grams in any order.
However, I have heard that some CMU software (like Sphinx) expects the
N-grams to be sorted lexicographically left-to-right. The latest release
contains a script "sort-lm" that reorders the N-grams in a manner that
should be agreeable to the CMU software. It is documented in the lm-scripts(1)
In message <20030311232159.A15739 at luistervink.cs.utwente.nl>you wrote:
> Hello Andreas,
> Is there any explicit sorting that LM's in ARPA format should have? Specifica
> lly, is there a standard sort order for the words of uni-, bi- and trigrams?
> (e.g. <unk> first, then diacritics, then alphabetically, then...).
> We've had some problems with arpa's written by SRILM that the CMU toolkit can
> 't handle, and we suspect a problem in the sorting of n-grams.
> melis at cs.utwente.nl
More information about the SRILM-User