[SRILM User List] OOV terminology
Sander Maijers
S.N.Maijers at student.ru.nl
Wed Jul 3 13:05:44 PDT 2013
On 03-07-13 20:22, Joris Pelemans wrote:
> Hello all,
>
> My question is perhaps a little bit of topic, but I'm hoping for your
> cooperation, since it's LM related.
>
> Say we have a training corpus with lexicon V_train. Since some of the
> words have near-zero counts, we choose to exclude them from our LM. This
> gives us a new lexicon, let's call it V_final. However this also gives
> us two types of OOV words: those not in V_train and those not in
> V_final. I was wondering whether there are standard terms in the
> literature for these two types of OOVs. I have read my share of papers,
> but none of them seem to make this distinction.
>
> Kind regards,
>
> Joris
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
Hi Joris,
In my view the vocabulary is a superset of the actual set of the
wordforms for which all wordform sequences (the N-permutations of
vocabulary words, with repetion) are modeled in the N-gram LM.
What limits the hypothesized transcript produced by an ASR system, is
the intersection between the sets of:
a. the wordforms in the pronunciation lexicon (the mapping between
acoustic feature sequences and orthographic representations)
b. the target words of the wordform sequences in the LM (as opposed to
history words)
The vocabulary does not matter then: is just an optional means to
constrain the potential richness (given the written training data) of an
N-gram LM that you are creating. You can use a vocabulary as a
constraint ('-limit-vocab' in' ngram-count'), and/or use it to
facilitate a preprocessed form of training data by means of special
tokens that aren't really words (such as "<unk>" or a 'proper name
class' token).
So, the vocabulary may contain superfluous words. Only after you realize
that this is not an issue, you could think about it further and say that
after you have created and pruned an LM, you can find out which words
were actually redundant in your vocabulary given the same written
training data you used to create that LM, and you could just as well
drop those and those words from the vocabulary you had already before
creating your LM. Maybe that reduces the size of your vocabulary as much
as you hope. Will this be worthwhile? Not for the ASR task, you see.
The term OOV comes in handy as shorthand to denote words that are in the
written training data but not in the vocabulary. It is not precise, you
could just as well use an element-out-of-set notation (short and clear)
in reports. Maybe you have read the article: "Detection of OOV Words
Using Generalized Word Models and a Semantic Class Language Model" by
Schaaf, which was a top Google result for me. This author confuses the
pronunciation lexicon with the vocabulary. While you can, confusingly,
call a word that was not transcribed correctly because, for one, it was
not modeled by the pronunciation lexicon 'OOV', I think it is not okay
to confuse the concepts vocabulary and pronunciation lexicon as he does.
I hope this clears up any confusion?
More information about the SRILM-User
mailing list