SRILM transcription format

Andreas Stolcke stolcke at speech.sri.com
Tue Mar 19 15:15:25 PST 2002


Ben,

SRILM does not rely on any fancy transcription conventions.
It tokenizes the input using the strtok() function from the C library.
It doesn't know about XML or any other tagging schemes.

What this boils down to is:

Everything that is separated by whitespace (space, newline, tabs) is 
considered a word.  Case distinctions are preserved unless you use the
"-tolower" option in various tools.  Punctuation is treated as just another
non-whitespace character.  So you would have to strip punctuation if you
wanted to ignore it in your modeling, or surround punctuation marks with
whitespace if you wanted to model them as word tokens of their own.

--Andreas

In message <000701c1cf99$5416a280$dd00a8c0 at dejima.com>you wrote:
> Andreas,
> 
> Hello, could you point me to a document describing in detail the
> transcription conventions for SRILM tools?
> 
> For example, can words be capitalized? What punctuation is permitted
> (apostrophe? period? comma?)
> 
> Thank you,
> 
> ________________________________
> Ben Reaves      benreaves at ieee.org
> 
> 




More information about the SRILM-User mailing list