SRILM and LC_ALL

Fri Oct 19 11:00:36 PDT 2007

David et al.,

there were several issues with add-pauses-to-pfsg and UTF-8 locales.
The regular expression /[x80-x8F]/ is not legal in UTF-8 locales because
it contains characters with the high bit set (UTF-8 uses the high bit to
encode multibyte characters).
I fixed this recently by using a different but equivalent regex instead.

The other problem is that pre-3.1.5 (actually pre-3.1.4) gawk
was not using ctype library functions for implementing character classes 
like [:lower:].

So, the upshot is that if you 

1) get the latest beta version (to fixed the regex issue) AND
2) use gawk 3.1.5 or later

you should be able to use add-pauses-to-pfsg and pass the "make-ngram-pfsg"
test regardless of locale setting.  You CAN use gawk 3.1.3 (which is 
what seems to be pre-installed on many Linux system) but then you need
use LANG=C or LANG=en_US.

I added a note about this to various documentation files.

--Andreas

In message <Pine.LNX.4.63.0710082207370.17151 at lamb.ICSI.Berkeley.EDU>you wrote:
> 
> > My default locale is en_US.  With this locale, I do not see the error David
>  
> > Brodbeck did, even if I use gawk 3.1.5.  If I set LANG=en_US.UTF-8 and use 
> > gawk 3.1.5, then I see the error:
> >
> > $ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
> > gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: fatal: 
> > Invalid collation character: /[[:lower:]-?]/
> 
> A followup:
> 
> At home, I'm running gawk 3.1.15 under Fedora Core 3 and my default 
> locale is en_US.UTF-8:
> 
>   $ locale
>   LANG=en_US.UTF-8
>   LC_CTYPE="en_US.UTF-8"
>   LC_NUMERIC="en_US.UTF-8"
>   LC_TIME="en_US.UTF-8"
>   LC_COLLATE="en_US.UTF-8"
>   LC_MONETARY="en_US.UTF-8"
>   LC_MESSAGES="en_US.UTF-8"
>   LC_PAPER="en_US.UTF-8"
>   LC_NAME="en_US.UTF-8"
>   LC_ADDRESS="en_US.UTF-8"
>   LC_TELEPHONE="en_US.UTF-8"
>   LC_MEASUREMENT="en_US.UTF-8"
>   LC_IDENTIFICATION="en_US.UTF-8"
>   LC_ALL=
> 
> If I use the default locale, I get the "Invalid collation character" 
> error.  If I set LANG=C, I get the same error.
> 
> If I set LC_ALL=en_US, that error goes away but the make-ngram-pfsg 
> test fails with the message "make-ngram-pfsg: stdout output DIFFERS". 
> I think this is because when LC_ALL is set it overrides the other LC_* 
> variables (http://opengroup.org/onlinepubs/007908799/xbd/envvar.html). 
> This means that the line in test/tests/make-ngram-pfsg/run-test which 
> sets LC_COLLATE=C has no effect when LC_ALL is set.
> 
> If I set LANG=en_US and leave LC_ALL unset, then the
> "Invalid collation character" error goes away and the make-ngram-pfsg
> test passes.
> 
> So it appears that the gawk locale tips in the SRILM INSTALL file may 
> need to be updated to reflect gawk 3.1.15's behavior.  Please let me 
> know if there's anything else I could do to help with this.
> 
> Regards,
> David
> 
> 
> 
> 
>