[SRILM User List] Using hidden events

Sun Jan 22 11:19:36 PST 2012

Hi,

I would like to use models with hidden vocabulary for filled pauses
but I am not sure what is the right way to train and test such models.
I have a train and test data containing filled pauses between words as
well as 'clean' datasets where FPs are removed.
The filled pauses are going to be modeled as '-observed -omit' or '-observed'.
The questions are:
  -  Should I train the model on the data containing the FPs or on the
clean data?
  - Which vocabulary to use during training and test: with FP or
without, since FP word is included into hidden vocabulary?

I am also trying to estimate local perplexity of the words following
filled pauses. I extracted these words together with the contexts into
separate sentences, e.g:
eine woche <FP> was
aus vom <FP> sonnabend

and applied trained LM on them. Total perplexity is calculated as 10^(
- totalLogProb / N ), where totalLogProb is the sum of log
probabilities of the words predicted after <FP>.

The same value is then calculated on these chunks where <FP> have been
removed from the context:
eine woche was
aus vom sonnabend.

Is this right?

Which setup should I use in order to calculate the local perplexity,
when I want to model FPs as hidden events with '-observed -omit'
options?

Thanks in advance.

Yours,
Dmytro.