[SRILM User List] Using hidden events

Sun Jan 22 19:35:37 PST 2012

In message <CANskbNPrejDNeRka9hEM+bFbqMF1Kp+4hMOKyb3z1Lq=-RP_9A at mail.gmail.com>
you wrote:
> Hi,
> 
> I would like to use models with hidden vocabulary for filled pauses
> but I am not sure what is the right way to train and test such models.
> I have a train and test data containing filled pauses between words as
> well as 'clean' datasets where FPs are removed.
> The filled pauses are going to be modeled as '-observed -omit' or '-observed'

As stated in the ngram(1) man page, filled pauses should normally
modeled as -hidden-vocab tokens with -observed -omit.

> .
> The questions are:
>   -  Should I train the model on the data containing the FPs or on the
> clean data?

You need to have the FPs in the training data, since (1) they are observed
and (2) even hidden events need to be made "unhidden"  for training purposes.

There is no ready-made training procedure for hidden-event LMs.
You yourself have to extact the n-grams that correspond to the events
and histories implied by the LM.  For example, if "UH" is a filled pause and
the training data has 

	a b UH c d

and you want to train a 3gram LM, you need to generate ngrams

	UH	1
	b UH	1
	a b UH	1
	c	1
	b c	1
	a b c	1
	d	1
	c d	1
	b c d	1

and feed that to ngram-count -read plus any of the standard training 
options.

>   - Which vocabulary to use during training and test: with FP or
> without, since FP word is included into hidden vocabulary?

With FP in training (since there is no "hidden" vocabulary in training,
see above).

In testing it doesn't matter since all the tokens specified by -hidden-vocab 
are implicitly added to the overall LM vocabulary.

> 
> I am also trying to estimate local perplexity of the words following
> filled pauses. I extracted these words together with the contexts into
> separate sentences, e.g:
> eine woche <FP> was
> aus vom <FP> sonnabend

You want to use ngram -debug 2 -ppl 
and extract the probabilities from the output.

Andreas 

> 
> and applied trained LM on them. Total perplexity is calculated as 10^(
> - totalLogProb / N ), where totalLogProb is the sum of log
> probabilities of the words predicted after <FP>.
> 
> The same value is then calculated on these chunks where <FP> have been
> removed from the context:
> eine woche was
> aus vom sonnabend.
> 
> Is this right?
> 
> Which setup should I use in order to calculate the local perplexity,
> when I want to model FPs as hidden events with '-observed -omit'
> options?
> 
> Thanks in advance.
> 
> Yours,
> Dmytro.
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user

--Andreas