can srilm cope with xml tagged corpora?

Andreas Stolcke stolcke at speech.sri.com
Tue Jan 13 08:27:00 PST 2009


Matt Green wrote:
> I'd like to use srilm to generate bigram counts from the British 
> National Corpus in XML format. I see that the paper
>  "SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl. 
> Conf. Spoken Language Processing, Denver, Colorado, September 2002
> mentions that support for SGML-tagged formats is regarded as 
> desirable: has this support been implemented in the toolkit at this 
> time please?
>
There's been a conscious decision to leave all text processing, 
filtering, conditioning, etc. out of SRILM as it tends to be too 
application-specific.  So you'll have to use other available tools or 
your own to convert SGML to a pure ascii format, with words separated by 
whitespace.

Andreas

> thanks,
> --matt





More information about the SRILM-User mailing list