[SRILM User List] Using hidden events

Mon Jan 23 09:44:54 PST 2012

On 1/23/2012 3:25 AM, Dmytro Prylipko wrote:
>
>>> .
>>> The questions are:
>>>    -  Should I train the model on the data containing the FPs or on the
>>> clean data?
>> You need to have the FPs in the training data, since (1) they are observed
>> and (2) even hidden events need to be made "unhidden"  for training purposes.
>>
>> There is no ready-made training procedure for hidden-event LMs.
>> You yourself have to extact the n-grams that correspond to the events
>> and histories implied by the LM.  For example, if "UH" is a filled pause and
>> the training data has
>>
>>         a b UH c d
>>
>> and you want to train a 3gram LM, you need to generate ngrams
>>
>>         UH      1
>>         b UH    1
>>         a b UH  1
>>         c       1
>>         b c     1
>>         a b c   1
>>         d       1
>>         c d     1
>>         b c d   1
>>
>> and feed that to ngram-count -read plus any of the standard training
>> options.
> Wow, sounds tricky. I guess this procedure is required for those
> disfluencies  which are omitted from the context, i.e. marked with the
> -omit option in the hidden vocabulary, but need to be predicted
> themselves. For other kinds, such as insertions, deletions and
> repairs, LM can be trained just with ngram-count, right?
Well, you need to train a single model for all types of tokens.  So it 
is easiest to write a perl script (for example) that extract the counts 
for all ngrams.

Note that you can write the script so that it processes one sentence at 
a time, and output just a bunch of ngrams with count 1.
ngram-count -read will take care of merging and summing the counts.

Andreas