[SRILM User List] Using hidden events

Mon Jan 23 03:25:12 PST 2012

On Mon, Jan 23, 2012 at 4:35 AM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> In message <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A at mail.gmail.com>
> you wrote:
>> Hi,
>>
>> I would like to use models with hidden vocabulary for filled pauses
>> but I am not sure what is the right way to train and test such models.
>> I have a train and test data containing filled pauses between words as
>> well as 'clean' datasets where FPs are removed.
>> The filled pauses are going to be modeled as '-observed -omit' or '-observed'
>
> As stated in the ngram(1) man page, filled pauses should normally
> modeled as -hidden-vocab tokens with -observed -omit.
>
>> .
>> The questions are:
>>   -  Should I train the model on the data containing the FPs or on the
>> clean data?
>
> You need to have the FPs in the training data, since (1) they are observed
> and (2) even hidden events need to be made "unhidden"  for training purposes.
>
> There is no ready-made training procedure for hidden-event LMs.
> You yourself have to extact the n-grams that correspond to the events
> and histories implied by the LM.  For example, if "UH" is a filled pause and
> the training data has
>
>        a b UH c d
>
> and you want to train a 3gram LM, you need to generate ngrams
>
>        UH      1
>        b UH    1
>        a b UH  1
>        c       1
>        b c     1
>        a b c   1
>        d       1
>        c d     1
>        b c d   1
>
> and feed that to ngram-count -read plus any of the standard training
> options.

Wow, sounds tricky. I guess this procedure is required for those
disfluencies  which are omitted from the context, i.e. marked with the
-omit option in the hidden vocabulary, but need to be predicted
themselves. For other kinds, such as insertions, deletions and
repairs, LM can be trained just with ngram-count, right?

>
>
>>   - Which vocabulary to use during training and test: with FP or
>> without, since FP word is included into hidden vocabulary?
>
> With FP in training (since there is no "hidden" vocabulary in training,
> see above).
>
> In testing it doesn't matter since all the tokens specified by -hidden-vocab
> are implicitly added to the overall LM vocabulary.
>
>>
>> I am also trying to estimate local perplexity of the words following
>> filled pauses. I extracted these words together with the contexts into
>> separate sentences, e.g:
>> eine woche <FP> was
>> aus vom <FP> sonnabend
>
> You want to use ngram -debug 2 -ppl
> and extract the probabilities from the output.
>
> Andreas
>
>>
>> and applied trained LM on them. Total perplexity is calculated as 10^(
>> - totalLogProb / N ), where totalLogProb is the sum of log
>> probabilities of the words predicted after <FP>.
>>
>> The same value is then calculated on these chunks where <FP> have been
>> removed from the context:
>> eine woche was
>> aus vom sonnabend.
>>
>> Is this right?
>>
>> Which setup should I use in order to calculate the local perplexity,
>> when I want to model FPs as hidden events with '-observed -omit'
>> options?
>>
>> Thanks in advance.
>>
>> Yours,
>> Dmytro.
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
> --Andreas