[SRILM User List] How to process the disfluency words when building LM

Fri Mar 9 16:28:17 PST 2012

On 3/4/2012 4:56 PM, Meng Chen wrote:
> Hello, I tried to make the language model from some 
> non-native spontaneous speech transcription. However, there are lots 
> of "strange words" in the corpus because the transcriber tried to 
> transcribe as close as the real pronunciation.
>
> For example, some transcriptions are as follows:
>
> <s> she taught english there and she gave english lesson to a 
> secondary school students in *boli bolivi  bolivia*</s>
> <s> *er* what's wrong *er *he asked she asked </s>
> <s> her her mother would *em er* her she took her mother in her own 
> house and the baby *em* *moven bester*</s>

First, such words are not strange at all, and occur even for native 
speakers when speaking spontaneously.
"er" and "em" are called "filled pauses", and "boli" etc. "word 
fragments".   Both are associated with a more general class of  
spontaneous speech phenomena called "disfluencies".   For an overview 
see 
http://www.speech.sri.com/cgi-bin/run-distill?papers/icslp96-dfs-swb.ps.gz .
>
> So I want to ask how should I process these "strange words" that don't 
> exist such as boli, bolivi, er, em, moven, bester etc.
> If I replace them with the correct words, the language model will be 
> unsuitable for the non-native spontaneous speech task.
> If I keep them, their counts and probability are too small. And the 
> dictionary is also hard to generate.
>
> Are there any suggestions?
Filled pauses are usually modeled as any other words, though you might 
normalize their spellings.  There are usually just two forms, with and 
without nasal (usually spelled "um" and "uh" respectively). You should 
normalize alternative spellings like "ah", "eh",  "er", etc. and map 
them to the standard form to avoid fragmenting your data.   Often people 
use a dedicated vowel phone for pronunciations of these words because 
they are more variable in quality and duration than the standard schwa 
phone.

Fragments, especially short ones, are hard to recognize because they are 
very confusable.   First, you should use a spelling convention that 
distinguishes them from full words, usually with a final hyphen, e.g., 
"boli-".
For LM training purposes you might want to delete them entirely, and 
represent them with a garbage model in acoustic training to avoid 
contaminating the models for regular words.
At SRI we tried modeling the most frequent word fragments in AM and LM, 
but even those (especially because they tend to have just one or two 
phones) are not recognized well, and removing them from the LM was best 
for overall word recognition accuracy.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120309/a27e4f09/attachment.html>