stolcke at speech.sri.com
Wed Mar 14 10:32:08 PDT 2007
B. Plank wrote:
> Dear SRILM mailing list,
> I am wondering.. when I try to train a language model with ngram-count and
> the –tolower option,
> I’m getting the following error:
> assertion "i < maxWordLength" failed: file "Vocab.cc", line 97
> The input corpus (-text) is an utf8 file. Might this cause the problem?
> I am grateful for any suggestion.
-tolower is simply implemented by the C library tolower() function,
which is controlled by the OS's locale settings.
I am not sure if tolower() works correctly for UTF8, and if it does you
probably have to set LC_CTYPE to something
appropriate. In other words, this is all beyond the scope of what the
SRILM code itself handles.
I would write a little test program that calls tolower() on some test
data to make sure it does what you want.
More information about the SRILM-User