[SRILM User List] buid a language model multiword

Fri Mar 17 11:54:24 PDT 2017

On 3/17/2017 7:44 AM, Van Tuan MAI wrote:
> hello,
>
> Now i have a text file that contain all the word in story and a vocab 
> file that include not only normaly word but also wrong pronunciation 
> words (a, b(b1, b2),c(c1, c2, c3)). SO can i add b1, b2, c1, c2 into 
> N-gram models??
>
I'm not sure I fully understand your notation (can you give examples of 
what b, b1, b2, etc. stand for?)  but you can train an LM on "normal"  
or "wrong" words as you wish.  The software makes no difference between 
those.

You have to experiment to find out if mapping "wrong"  to "normal" words 
(usually called "text normalization" or TN) would help the performance 
of your overall system.   The rationale for TN is that is reduces the 
sparseness of your data and thereby improves generalization.  Also, if 
you have a postprocessing step that interprets the words it might help 
to only deal with "normal" words.

Andreas