[SRILM User List] Using hidden events
stolcke at icsi.berkeley.edu
Mon Jan 23 09:44:54 PST 2012
On 1/23/2012 3:25 AM, Dmytro Prylipko wrote:
>>> The questions are:
>>> - Should I train the model on the data containing the FPs or on the
>>> clean data?
>> You need to have the FPs in the training data, since (1) they are observed
>> and (2) even hidden events need to be made "unhidden" for training purposes.
>> There is no ready-made training procedure for hidden-event LMs.
>> You yourself have to extact the n-grams that correspond to the events
>> and histories implied by the LM. For example, if "UH" is a filled pause and
>> the training data has
>> a b UH c d
>> and you want to train a 3gram LM, you need to generate ngrams
>> UH 1
>> b UH 1
>> a b UH 1
>> c 1
>> b c 1
>> a b c 1
>> d 1
>> c d 1
>> b c d 1
>> and feed that to ngram-count -read plus any of the standard training
> Wow, sounds tricky. I guess this procedure is required for those
> disfluencies which are omitted from the context, i.e. marked with the
> -omit option in the hidden vocabulary, but need to be predicted
> themselves. For other kinds, such as insertions, deletions and
> repairs, LM can be trained just with ngram-count, right?
Well, you need to train a single model for all types of tokens. So it
is easiest to write a perl script (for example) that extract the counts
for all ngrams.
Note that you can write the script so that it processes one sentence at
a time, and output just a bunch of ngrams with count 1.
ngram-count -read will take care of merging and summing the counts.
More information about the SRILM-User