[SRILM User List] How to process the disfluency words when building LM

Sun Mar 4 16:56:01 PST 2012

Hello, I tried to make the language model from some
non-native spontaneous speech transcription. However, there are lots of
"strange words" in the corpus because the transcriber tried to transcribe
as close as the real pronunciation.

For example, some transcriptions are as follows:

<s> she taught english there and she gave english lesson to a secondary
school students in *boli bolivi  bolivia*</s>
<s> *er* what's wrong *er *he asked she asked </s>
<s> her her mother would *em er* her she took her mother in her own house
and the baby *em* *moven bester*</s>

So I want to ask how should I process these "strange words" that don't
exist such as boli, bolivi, er, em, moven, bester etc.
If I replace them with the correct words, the language model will be
unsuitable for the non-native spontaneous speech task.
If I keep them, their counts and probability are too small. And the
dictionary is also hard to generate.

Are there any suggestions?

Thanks!
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120305/68d66e90/attachment.html>