unicode & many files

Andreas Stolcke stolcke at speech.sri.com
Wed Sep 12 10:07:06 PDT 2007


Alexy Khrabrov wrote:
> How good is the unicode support -- e.g. for utf8?  I've fed it some 
> utf8 Cyrillics and it did fine.  How does it know we're using 
> multibyte or single byte characters?
SRILM is oblivious to character sets.  I uses whitespace to delimit 
words, but doesn't analyze them further.  As long as words are separated 
by ASCII whitespace most functions will work with any character set.

An exception to the above is the lower-case mapping enabled by the 
-tolower option of various tools.  This requires that your operating 
system knows how to map characters to lowercase via the tolower() 
library function.  This will interact with the locale setting which is 
typically controlled by environment variables.  But again, this is all 
outside SRILM, it's implemented by the OS and C library functions.
>
> Another question -- how do I feed many text files from a directory, 
> should I do multiple -text options after cooking them somehow, or use 
> -read on an accumulating count file?
You use Unix tools:  

    cat foo/file.* | ngram-count -text - ...

or

   find directory -type f (other options to select the right files) | 
xargs cat | ngram-count -text - ....

Creating separate count files and then cat-ing them together is also an 
option.

Andreas






More information about the SRILM-User mailing list