SRILM and LC_ALL
David Gelbart
gelbart at icsi.berkeley.edu
Mon Oct 8 22:24:49 PDT 2007
> My default locale is en_US. With this locale, I do not see the error David
> Brodbeck did, even if I use gawk 3.1.5. If I set LANG=en_US.UTF-8 and use
> gawk 3.1.5, then I see the error:
>
> $ /usr/local/bin/gawk -f `which add-pauses-to-pfsg`
> gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal:
> Invalid collation character: /[[:lower:]-?]/
A followup:
At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default
locale is en_US.UTF-8:
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=
If I use the default locale, I get the "Invalid collation character"
error. If I set LANG=C, I get the same error.
If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg
test fails with the message "make-ngram-pfsg: stdout output DIFFERS".
I think this is because when LC_ALL is set it overrides the other LC_*
variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html).
This means that the line in test/tests/make-ngram-pfsg/run-test which
sets LC_COLLATE=C has no effect when LC_ALL is set.
If I set LANG=en_US and leave LC_ALL unset, then the
"Invalid collation character" error goes away and the make-ngram-pfsg
test passes.
So it appears that the gawk locale tips in the SRILM INSTALL file may
need to be updated to reflect gawk 3.1.15's behavior. Please let me
know if there's anything else I could do to help with this.
Regards,
David
More information about the SRILM-User
mailing list