query regarding usage of SRILM toolkit

Tue Oct 3 22:27:51 PDT 2006

Greetings!!!

Thanks for the prompt reply. But the ideas you mentioned seems to be for 
boundary marking when the whole sequence is correct. Our recognition 
output is only 50% correct. That is we have a sequence of syllables that 
are just 50% correct from which we need to extract the words. The n-best 
results of the recognizer could be used to improve the performance. We can 
have a lattice of syllable sequence where each syllable has a n-best list.
Now, the task is to find the best word sequence from this n-best lattice. 
Do you have any similar programs. Please do reply.

Thanks in Advance.
Regards,
Lakshmi

On Fri, 29 Sep 2006, Andreas Stolcke wrote:

>
> In message <Pine.LNX.4.60.0609291425390.5866 at lantana.tenet.res.in>you wrote:
>>
>> Greetings!!!
>>
>> We are developing a syllable based isolated style continuous speech recognize
>> r
>> for Indian languages. Currently, our recognizer output is just a sequence of
>> syllables. We want to extract the sequence of words from this syllable sequen
>> ce
>> using statistical language models and lexicon.I thought may be one of the
>> programs in this  toolkit must be doing something similar (sub-word
>> sequence to word sequence conversion). But all the programs seems to use
>> word lattices.
>>
>> Is there any program in this toolkit that extracts the word sequence from
>> the sub-word sequence using LM and lexicon.
>
> Lashmi,
>
> first you have to remember that when the documentation of a program says
> 'words' it doesn't mean you have to use words in the conventional sense.
> you can use any kind of token (phones, syllables, etc.) in your lattices
> etc.
>
> The task you describe sounds like a boundary tagging problem, i.e., given
> a sequence of tokens, you want to label each transition between tokens as
> either a "boundary" or a "non-boundary".  There are two tools in SRILM
> that can do this, using different kind of models.  One is
> "hidden-ngram", which performs boundary tagging explicitly.
> The other is "disambig" which tags the tokens themselves, not the boundaries
> between them.  But by assigining tags that denote "first token in a unit",
> "token insde a unit', etc. you can perform boundary tagging implicitly.
> (The tokens in your case are the syllables, the units would be the words.)
> Both tools use ngram language models to disambiguate the input.
> The model can be trained from syllabified training data, in your case.
>
> I suggest you look up papers on "word segmentation", "sentence segmentation",
> "Mandarin tokenization", "chunk parsing" and "shallow parsing" to
> get a good idea of the existing models for this type of task,
> then study the manual pages for the programs.
>
> --Andreas
>
>
>>
>> Thanks in Advance.
>> Regards,
>> Lakshmi
>