SIRLM for unicode

Andreas Stolcke stolcke at speech.sri.com
Fri Jan 23 09:47:31 PST 2004


I'm not familiar with unicode, unfortunately.  However, SRILM does
not "interpret" characters other than for parsing lines of text into 
words.  It assumes that words are separated by spaces.  So if unicode
uses the same encoding of space characters as ASCII then you should be fine.

The case mappping functions (-tolower option) in various tools will
probably not work correctly for multi-byte character sets.

--Andreas

In message <40113180.4030109 at itc.it>you wrote:
> Dear All,
> Is it possible for me to use SIRLM for text corpus which was encoded in 
> unicode format ?
> Best regards.
> 




More information about the SRILM-User mailing list