[SRILM User List] Using hidden events
dmytro.prylipko at ovgu.de
Mon Jan 23 03:25:12 PST 2012
On Mon, Jan 23, 2012 at 4:35 AM, Andreas Stolcke
<stolcke at icsi.berkeley.edu> wrote:
> In message <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A at mail.gmail.com>
> you wrote:
>> I would like to use models with hidden vocabulary for filled pauses
>> but I am not sure what is the right way to train and test such models.
>> I have a train and test data containing filled pauses between words as
>> well as 'clean' datasets where FPs are removed.
>> The filled pauses are going to be modeled as '-observed -omit' or '-observed'
> As stated in the ngram(1) man page, filled pauses should normally
> modeled as -hidden-vocab tokens with -observed -omit.
>> The questions are:
>> - Should I train the model on the data containing the FPs or on the
>> clean data?
> You need to have the FPs in the training data, since (1) they are observed
> and (2) even hidden events need to be made "unhidden" for training purposes.
> There is no ready-made training procedure for hidden-event LMs.
> You yourself have to extact the n-grams that correspond to the events
> and histories implied by the LM. For example, if "UH" is a filled pause and
> the training data has
> a b UH c d
> and you want to train a 3gram LM, you need to generate ngrams
> UH 1
> b UH 1
> a b UH 1
> c 1
> b c 1
> a b c 1
> d 1
> c d 1
> b c d 1
> and feed that to ngram-count -read plus any of the standard training
Wow, sounds tricky. I guess this procedure is required for those
disfluencies which are omitted from the context, i.e. marked with the
-omit option in the hidden vocabulary, but need to be predicted
themselves. For other kinds, such as insertions, deletions and
repairs, LM can be trained just with ngram-count, right?
>> - Which vocabulary to use during training and test: with FP or
>> without, since FP word is included into hidden vocabulary?
> With FP in training (since there is no "hidden" vocabulary in training,
> see above).
> In testing it doesn't matter since all the tokens specified by -hidden-vocab
> are implicitly added to the overall LM vocabulary.
>> I am also trying to estimate local perplexity of the words following
>> filled pauses. I extracted these words together with the contexts into
>> separate sentences, e.g:
>> eine woche <FP> was
>> aus vom <FP> sonnabend
> You want to use ngram -debug 2 -ppl
> and extract the probabilities from the output.
>> and applied trained LM on them. Total perplexity is calculated as 10^(
>> - totalLogProb / N ), where totalLogProb is the sum of log
>> probabilities of the words predicted after <FP>.
>> The same value is then calculated on these chunks where <FP> have been
>> removed from the context:
>> eine woche was
>> aus vom sonnabend.
>> Is this right?
>> Which setup should I use in order to calculate the local perplexity,
>> when I want to model FPs as hidden events with '-observed -omit'
>> Thanks in advance.
>> SRILM-User site list
>> SRILM-User at speech.sri.com
More information about the SRILM-User