query regarding usage of SRILM toolkit

Lakshmi A lakshmi at lantana.tenet.res.in
Tue Oct 3 22:27:51 PDT 2006


Thanks for the prompt reply. But the ideas you mentioned seems to be for 
boundary marking when the whole sequence is correct. Our recognition 
output is only 50% correct. That is we have a sequence of syllables that 
are just 50% correct from which we need to extract the words. The n-best 
results of the recognizer could be used to improve the performance. We can 
have a lattice of syllable sequence where each syllable has a n-best list.
Now, the task is to find the best word sequence from this n-best lattice. 
Do you have any similar programs. Please do reply.

Thanks in Advance.

On Fri, 29 Sep 2006, Andreas Stolcke wrote:

> In message <Pine.LNX.4.60.0609291425390.5866 at lantana.tenet.res.in>you wrote:
>> Greetings!!!
>> We are developing a syllable based isolated style continuous speech recognize
>> r
>> for Indian languages. Currently, our recognizer output is just a sequence of
>> syllables. We want to extract the sequence of words from this syllable sequen
>> ce
>> using statistical language models and lexicon.I thought may be one of the
>> programs in this  toolkit must be doing something similar (sub-word
>> sequence to word sequence conversion). But all the programs seems to use
>> word lattices.
>> Is there any program in this toolkit that extracts the word sequence from
>> the sub-word sequence using LM and lexicon.
> Lashmi,
> first you have to remember that when the documentation of a program says
> 'words' it doesn't mean you have to use words in the conventional sense.
> you can use any kind of token (phones, syllables, etc.) in your lattices
> etc.
> The task you describe sounds like a boundary tagging problem, i.e., given
> a sequence of tokens, you want to label each transition between tokens as
> either a "boundary" or a "non-boundary".  There are two tools in SRILM
> that can do this, using different kind of models.  One is
> "hidden-ngram", which performs boundary tagging explicitly.
> The other is "disambig" which tags the tokens themselves, not the boundaries
> between them.  But by assigining tags that denote "first token in a unit",
> "token insde a unit', etc. you can perform boundary tagging implicitly.
> (The tokens in your case are the syllables, the units would be the words.)
> Both tools use ngram language models to disambiguate the input.
> The model can be trained from syllabified training data, in your case.
> I suggest you look up papers on "word segmentation", "sentence segmentation",
> "Mandarin tokenization", "chunk parsing" and "shallow parsing" to
> get a good idea of the existing models for this type of task,
> then study the manual pages for the programs.
> --Andreas
>> Thanks in Advance.
>> Regards,
>> Lakshmi

More information about the SRILM-User mailing list