SRILM and LC_ALL
gelbart at icsi.berkeley.edu
Mon Oct 8 18:32:33 PDT 2007
On July 19 2007, Andreas Stolcke wrote:
> David Brodbeck wrote:
> > I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5.
> > The machine type is i686_m64. Everything builds all right, but
> > the tests fail for make-ngram-pfsg, ngram-class, and
> > ngram-count-lm-limit-vocab.
> > make-ngram-pfsg is the most obvious one, so I'll tackle that one
> > first. I get the following in the stderr file:
> > gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid
> > collation character: /[[:lower:]-ÿ]/
> > Has anyone else run into this? I'm using GNU Awk 3.1.5, and the
> > locale is set to en_US.UTF-8.
> This is odd since we're also using gawk 3.1.5 and I cannot replicate
> the problem even when setting LANG to en_US.UTF-8. It seems that the
> interpretation of gawk regular expressions should not depend on the
> OS release version, but of course there may always be bugs.
Are you sure you used gawk 3.1.5 when you tried to replicate this?
The reason I ask is that at ICSI, the SRILM tools seem to invoke gawk
3.1.3, not gawk 3.1.5:
$ head -1 `which add-pauses-to-pfsg`
$ /usr/bin/gawk --version | head -1
GNU Awk 3.1.3
$ which gawk
$ /usr/local/bin/gawk --version | head -1
GNU Awk 3.1.5
My default locale is en_US. With this locale, I do not see the error
David Brodbeck did, even if I use gawk 3.1.5. If I set
LANG=en_US.UTF-8 and use gawk 3.1.5, then I see the error:
$ /usr/local/bin/gawk -f `which add-pauses-to-pfsg`
fatal: Invalid collation character: /[[:lower:]-?]/
Setting LC_ALL=C as suggested in the SRILM INSTALL file does not solve
tmp$ export LC_ALL=C
tmp$ /usr/local/bin/gawk -f `which add-pauses-to-pfsg`
fatal: Invalid collation character: /[[:lower:]-ÿ]/
The compute-oov-rate script gives a similar error.
David Brodbeck, if you're reading this, did setting LC_ALL=C solve
your problem with add-pauses-to-pfsg? This was not clear to me from
reading your July 23 email to Andreas.
More information about the SRILM-User