tolower option

Andreas Stolcke stolcke at speech.sri.com
Wed Mar 14 10:32:08 PDT 2007


B. Plank wrote:
> Dear SRILM mailing list,
>
> I am wondering.. when I try to train a language model with ngram-count and
> the –tolower option,
> I’m getting the following error:
>
> assertion "i < maxWordLength" failed: file "Vocab.cc", line 97
>
> The input corpus (-text) is an utf8 file. Might this cause the problem?
>
> I am grateful for any suggestion.
>
>   
-tolower is simply implemented by the C library tolower() function, 
which is controlled by the OS's locale settings.
I am not sure if tolower() works correctly for UTF8, and if it does you 
probably have to set LC_CTYPE to something
appropriate. In other words, this is all beyond the scope of what the 
SRILM code itself handles.

I would write a little test program that calls tolower() on some test 
data to make sure it does what you want.

Andreas





More information about the SRILM-User mailing list