[SRILM User List] How to process the disfluency words when building LM
Andreas Stolcke
stolcke at icsi.berkeley.edu
Fri Mar 9 16:28:17 PST 2012
On 3/4/2012 4:56 PM, Meng Chen wrote:
> Hello, I tried to make the language model from some
> non-native spontaneous speech transcription. However, there are lots
> of "strange words" in the corpus because the transcriber tried to
> transcribe as close as the real pronunciation.
>
> For example, some transcriptions are as follows:
>
> <s> she taught english there and she gave english lesson to a
> secondary school students in *boli bolivi bolivia*</s>
> <s> *er* what's wrong *er *he asked she asked </s>
> <s> her her mother would *em er* her she took her mother in her own
> house and the baby *em* *moven bester*</s>
First, such words are not strange at all, and occur even for native
speakers when speaking spontaneously.
"er" and "em" are called "filled pauses", and "boli" etc. "word
fragments". Both are associated with a more general class of
spontaneous speech phenomena called "disfluencies". For an overview
see
http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp96-dfs-swb.ps.gz .
>
> So I want to ask how should I process these "strange words" that don't
> exist such as boli, bolivi, er, em, moven, bester etc.
> If I replace them with the correct words, the language model will be
> unsuitable for the non-native spontaneous speech task.
> If I keep them, their counts and probability are too small. And the
> dictionary is also hard to generate.
>
> Are there any suggestions?
Filled pauses are usually modeled as any other words, though you might
normalize their spellings. There are usually just two forms, with and
without nasal (usually spelled "um" and "uh" respectively). You should
normalize alternative spellings like "ah", "eh", "er", etc. and map
them to the standard form to avoid fragmenting your data. Often people
use a dedicated vowel phone for pronunciations of these words because
they are more variable in quality and duration than the standard schwa
phone.
Fragments, especially short ones, are hard to recognize because they are
very confusable. First, you should use a spelling convention that
distinguishes them from full words, usually with a final hyphen, e.g.,
"boli-".
For LM training purposes you might want to delete them entirely, and
represent them with a garbage model in acoustic training to avoid
contaminating the models for regular words.
At SRI we tried modeling the most frequent word fragments in AM and LM,
but even those (especially because they tend to have just one or two
phones) are not recognized well, and removing them from the LM was best
for overall word recognition accuracy.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120309/a27e4f09/attachment.html>
More information about the SRILM-User
mailing list