[SRILM User List] buid a language model multiword
Andreas Stolcke
stolcke at icsi.berkeley.edu
Fri Mar 17 11:54:24 PDT 2017
On 3/17/2017 7:44 AM, Van Tuan MAI wrote:
> hello,
>
> Now i have a text file that contain all the word in story and a vocab
> file that include not only normaly word but also wrong pronunciation
> words (a, b(b1, b2),c(c1, c2, c3)). SO can i add b1, b2, c1, c2 into
> N-gram models??
>
I'm not sure I fully understand your notation (can you give examples of
what b, b1, b2, etc. stand for?) but you can train an LM on "normal"
or "wrong" words as you wish. The software makes no difference between
those.
You have to experiment to find out if mapping "wrong" to "normal" words
(usually called "text normalization" or TN) would help the performance
of your overall system. The rationale for TN is that is reduces the
sparseness of your data and thereby improves generalization. Also, if
you have a postprocessing step that interprets the words it might help
to only deal with "normal" words.
Andreas
More information about the SRILM-User
mailing list