bug in lattice-tool?

Andreas Stolcke stolcke at speech.sri.com
Wed Nov 8 06:30:00 PST 2006


SRILM uses the strcmp() C library function to compare strings.
I suspect what you're seeing is a function of locale settings 
by way of environment variable such as LANG and LC_COLLATE.
This is almost certainly an OS-dependent issue.
First, I would try setting $LANG to "C" and unset any of the LC_* variables.

I would write a little test program that invokes strcmp() and 
observe the effect of locale settings on the result.

BTW, I have used SRILM with spanish, which also has diacritics in
the vocabulary and it works fine.  

--Andreas

In message <20061108082843.47119.qmail at web25401.mail.ukl.yahoo.com>you wrote:
> Andreas,
> 
> We've possibly found a bug in lattice-tool. Here, in
> Brno, we work with th Czech language that has
> diacritized letters. So, lattice-tool does everything
> well with all the calculations until it comes to
> matching of the best path with the reference file to
> get number of del, subs and ins - and finally WER. It
> appears that if both files are in ISO encoding and
> there is a diacritized letter in the reference, it can
> be matched to a non-diacritized word in the output,
> that is actually a different word. So, the WER goes
> down significantly from what really is (and what is
> correctly output by HResults in HTK).
> 
> best regards,
> Ilya
> 
> Send instant messages to your online friends http://uk.messenger.yahoo.com 




More information about the SRILM-User mailing list