[SRILM User List] OOV terminology

Wed Jul 3 13:05:44 PDT 2013

On 03-07-13 20:22, Joris Pelemans wrote:
> Hello all,
>
> My question is perhaps a little bit of topic, but I'm hoping for your
> cooperation, since it's LM related.
>
> Say we have a training corpus with lexicon V_train. Since some of the
> words have near-zero counts, we choose to exclude them from our LM. This
> gives us a new lexicon, let's call it V_final. However this also gives
> us two types of OOV words: those not in V_train and those not in
> V_final. I was wondering whether there are standard terms in the
> literature for these two types of OOVs. I have read my share of papers,
> but none of them seem to make this distinction.
>
> Kind regards,
>
> Joris
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user

Hi Joris,

In my view the vocabulary is a superset of the actual set of the 
wordforms for which all wordform sequences (the N-permutations of 
vocabulary words, with repetion) are modeled in the N-gram LM.

What limits the hypothesized transcript produced by an ASR system, is 
the intersection between the sets of:
a. the wordforms in the pronunciation lexicon (the mapping between 
acoustic feature sequences and orthographic representations)
b. the target words of the wordform sequences in the LM (as opposed to 
history words)

The vocabulary does not matter then: is just an optional means to 
constrain the potential richness (given the written training data) of an 
N-gram LM that you are creating. You can use a vocabulary as a 
constraint ('-limit-vocab' in' ngram-count'), and/or use it to 
facilitate a preprocessed form of training data by means of special 
tokens that aren't really words (such as "<unk>" or a 'proper name 
class' token).

So, the vocabulary may contain superfluous words. Only after you realize 
that this is not an issue, you could think about it further and say that 
after you have created and pruned an LM, you can find out which words 
were actually redundant in your vocabulary given the same written 
training data you used to create that LM, and you could just as well 
drop those and those words from the vocabulary you had already before 
creating your LM. Maybe that reduces the size of your vocabulary as much 
as you hope. Will this be worthwhile? Not for the ASR task, you see.

The term OOV comes in handy as shorthand to denote words that are in the 
written training data but not in the vocabulary. It is not precise, you 
could just as well use an element-out-of-set notation (short and clear) 
in reports. Maybe you have read the article: "Detection of OOV Words 
Using Generalized Word Models and a Semantic Class Language Model" by 
Schaaf, which was a top Google result for me. This author confuses the 
pronunciation lexicon with the vocabulary. While you can, confusingly, 
call a word that was not transcribed correctly because, for one, it was 
not modeled by the pronunciation lexicon 'OOV', I think it is not okay 
to confuse the concepts vocabulary and pronunciation lexicon as he does.

I hope this clears up any confusion?