can srilm cope with xml tagged corpora?
Andreas Stolcke
stolcke at speech.sri.com
Tue Jan 13 08:27:00 PST 2009
Matt Green wrote:
> I'd like to use srilm to generate bigram counts from the British
> National Corpus in XML format. I see that the paper
> "SRILM - An Extensible Language Modeling Toolkit", in Proc. Intl.
> Conf. Spoken Language Processing, Denver, Colorado, September 2002
> mentions that support for SGML-tagged formats is regarded as
> desirable: has this support been implemented in the toolkit at this
> time please?
>
There's been a conscious decision to leave all text processing,
filtering, conditioning, etc. out of SRILM as it tends to be too
application-specific. So you'll have to use other available tools or
your own to convert SGML to a pure ascii format, with words separated by
whitespace.
Andreas
> thanks,
> --matt
More information about the SRILM-User
mailing list