[SRILM User List] Using hidden events

Mon Jan 23 06:23:28 PST 2012

Dear Andreas,

I am conducting experiments on filled pauses and some results are puzzle for me.

I estimated a perplexity of words following filled pauses in two ways:
(1) taking FPs into account (FP is modeled as a regular word, not
hidden event) and (2) after removal of them from both train and test
data.

I account just log probabilities of the words placed after FPs
(obtained with ngram -debug 2 -ppl), not for the FPs.

The first approach provides lower perplexity, which is expected.

But when using -hidden-vocab I have some strange results which are not
clear for me.

For example, I can assume that using language model (trained on
'clean' data, i.e. w/o FPs) together with 'FP -observed -omit' on the
test data containing pauses (i.e. 'not clean') should lead to the same
result as for word-only model (approach (2)), since we predict only
words and context is freed from disfluencies.

However, this assumption is not supported with experiments.
Using 'clean' model with hidden vocabulary on test data containing
pauses gives much higher perplexity (364 -> 400).
I found that word probability after FP in such a case is always
modeled with unigrams. I can conclude that FPs are not omitted from
the context despite of the hidden event instruction. This is supported
by the fact that the result is equal when I use either '-observed
-omit' or just '-observed' or just '-omit'.

Also I thought that using a model, which consider filled pauses as
regular words, incorporating hidden vocabulary with 'FP -observed'
should not change the result as well, since pauses are not omitted
from the context in this case.

This is not true as well, I get the value of 295 without hidden
vocabulary and 291 with it.

Also, I found that perplexity values do not change when I use 'FP
-observed' or just 'FP -omit' in the hidden vocabulary, which looks
very strange.

I would be greatly appreciated if you could clarify these questions.

Yours,
Dmytro.