SRILM and LC_ALL

David Gelbart gelbart at icsi.berkeley.edu
Mon Oct 8 22:24:49 PDT 2007


> My default locale is en_US.  With this locale, I do not see the error David 
> Brodbeck did, even if I use gawk 3.1.5.  If I set LANG=en_US.UTF-8 and use 
> gawk 3.1.5, then I see the error:
>
> $ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
> gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: 
> Invalid collation character: /[[:lower:]-?]/

A followup:

At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default 
locale is en_US.UTF-8:

  $ locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_COLLATE="en_US.UTF-8"
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_PAPER="en_US.UTF-8"
  LC_NAME="en_US.UTF-8"
  LC_ADDRESS="en_US.UTF-8"
  LC_TELEPHONE="en_US.UTF-8"
  LC_MEASUREMENT="en_US.UTF-8"
  LC_IDENTIFICATION="en_US.UTF-8"
  LC_ALL=

If I use the default locale, I get the "Invalid collation character" 
error.  If I set LANG=C, I get the same error.

If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg 
test fails with the message "make-ngram-pfsg: stdout output DIFFERS". 
I think this is because when LC_ALL is set it overrides the other LC_* 
variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html). 
This means that the line in test/tests/make-ngram-pfsg/run-test which 
sets LC_COLLATE=C has no effect when LC_ALL is set.

If I set LANG=en_US and leave LC_ALL unset, then the
"Invalid collation character" error goes away and the make-ngram-pfsg
test passes.

So it appears that the gawk locale tips in the SRILM INSTALL file may 
need to be updated to reflect gawk 3.1.15's behavior.  Please let me 
know if there's anything else I could do to help with this.

Regards,
David








More information about the SRILM-User mailing list