SRILM and LC_ALL

David Gelbart gelbart at icsi.berkeley.edu
Mon Oct 8 18:32:33 PDT 2007


On July 19 2007, Andreas Stolcke wrote:
> David Brodbeck wrote:
> > I'm trying to build SRILM 1.5.2 on Redhat Enterprise Linux Server 5.
> > The machine type is i686_m64.  Everything builds all right, but 
> > the tests fail for make-ngram-pfsg, ngram-class, and
> > ngram-count-lm-limit-vocab.
> >
> > make-ngram-pfsg is the most obvious one, so I'll tackle that one
> > first.  I get the following in the stderr file:
> > gawk: /opt/srilm/bin/i686-m64/add-pauses-to-pfsg:22: fatal: Invalid
> > collation character: /[[:lower:]-ÿ]/
>
> > Has anyone else run into this?  I'm using GNU Awk 3.1.5, and the
> > locale is set to en_US.UTF-8.
>
> This is odd since we're also using gawk 3.1.5 and I cannot replicate 
> the problem even when setting LANG to en_US.UTF-8. It seems that the 
> interpretation of gawk regular expressions should not depend on the 
> OS release version, but of course there may always be bugs.

Hi Andreas,

Are you sure you used gawk 3.1.5 when you tried to replicate this? 
The reason I ask is that at ICSI, the SRILM tools seem to invoke gawk 
3.1.3, not gawk 3.1.5:

$ head -1 `which add-pauses-to-pfsg`
#!/usr/bin/gawk -f
$ /usr/bin/gawk --version | head -1
GNU Awk 3.1.3
$ which gawk
/usr/local/bin/gawk
$ /usr/local/bin/gawk --version | head -1
GNU Awk 3.1.5

My default locale is en_US.  With this locale, I do not see the error 
David Brodbeck did, even if I use gawk 3.1.5.  If I set 
LANG=en_US.UTF-8 and use gawk 3.1.5, then I see the error:

$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: 
fatal: Invalid collation character: /[[:lower:]-?]/

Setting LC_ALL=C as suggested in the SRILM INSTALL file does not solve 
the problem:

tmp$ export LC_ALL=C
tmp$ /usr/local/bin/gawk  -f `which add-pauses-to-pfsg`
gawk: /u/drspeech/src/srilm/devel/bin/i686/add-pauses-to-pfsg:22: 
fatal: Invalid collation character: /[[:lower:]-ÿ]/

The compute-oov-rate script gives a similar error.

David Brodbeck, if you're reading this, did setting LC_ALL=C solve 
your problem with add-pauses-to-pfsg?  This was not clear to me from 
reading your July 23 email to Andreas.

Thanks,
David


More information about the SRILM-User mailing list