query regarding usage of SRILM toolkit

Fri Sep 29 09:04:06 PDT 2006

In message <Pine.LNX.4.60.0609291425390.5866 at lantana.tenet.res.in>you wrote:
> 
> Greetings!!!
> 
> We are developing a syllable based isolated style continuous speech recognize
> r 
> for Indian languages. Currently, our recognizer output is just a sequence of 
> syllables. We want to extract the sequence of words from this syllable sequen
> ce 
> using statistical language models and lexicon.I thought may be one of the 
> programs in this  toolkit must be doing something similar (sub-word 
> sequence to word sequence conversion). But all the programs seems to use 
> word lattices.
> 
> Is there any program in this toolkit that extracts the word sequence from 
> the sub-word sequence using LM and lexicon.

Lashmi,

first you have to remember that when the documentation of a program says
'words' it doesn't mean you have to use words in the conventional sense.
you can use any kind of token (phones, syllables, etc.) in your lattices
etc.

The task you describe sounds like a boundary tagging problem, i.e., given
a sequence of tokens, you want to label each transition between tokens as 
either a "boundary" or a "non-boundary".  There are two tools in SRILM
that can do this, using different kind of models.  One is 
"hidden-ngram", which performs boundary tagging explicitly.
The other is "disambig" which tags the tokens themselves, not the boundaries
between them.  But by assigining tags that denote "first token in a unit",
"token insde a unit', etc. you can perform boundary tagging implicitly.
(The tokens in your case are the syllables, the units would be the words.)
Both tools use ngram language models to disambiguate the input.
The model can be trained from syllabified training data, in your case.

I suggest you look up papers on "word segmentation", "sentence segmentation",
"Mandarin tokenization", "chunk parsing" and "shallow parsing" to 
get a good idea of the existing models for this type of task,
then study the manual pages for the programs.

--Andreas 

> 
> Thanks in Advance.
> Regards,
> Lakshmi