[SRILM User List] OOV terminology

Wed Jul 3 14:30:18 PDT 2013

On 03-07-13 23:05, Joris Pelemans wrote:
> Sander,
>
> Thank you for your elaborate reply, but it doesn't really answer my
> question. I am not confused about the different sets of words. I know
> why they are there and what they are used for, but I'm wondering whether
> there is a standard term to denote each set individually. Let me
> rephrase my question with a very simple example:
>
> Given a single training sentence, "wrong is wrong" and a language model
> with cut-off 1, what are the terms to denote the following sets:
>
>  1. {wrong, is}?
>  2. {wrong}?
>  3. {is}?
>  4. all other English words?
>
> I am especially interested in terms that differentiate between sets 3
> and 4, if such terms exist.
>
> Regards,
>
> Joris

My response was an attemp to clear up you confusion as it appeared to me 
from what you wrote about V_final. Such a vocabulary simply does not 
exist, unless you make one physically. You only mentioned excluding 
words from an LM.

I am confident that there are no terms for those sets of hypothetical 
vocabularies you list. You can of course give them names and describe 
their meaning, like the vocabulary V_n = { w | \forall W \in V(C(w) > 
n)} where C is a function that counts the number times a word occurs in 
the written training data.
But, do you have an opinion as to why such terms would be needed and why 
they would be better than a definition like the previous one?

> On 07/03/13 22:05, Sander Maijers wrote:
>> On 03-07-13 20:22, Joris Pelemans wrote:
>>> Hello all,
>>>
>>> My question is perhaps a little bit of topic, but I'm hoping for your
>>> cooperation, since it's LM related.
>>>
>>> Say we have a training corpus with lexicon V_train. Since some of the
>>> words have near-zero counts, we choose to exclude them from our LM. This
>>> gives us a new lexicon, let's call it V_final. However this also gives
>>> us two types of OOV words: those not in V_train and those not in
>>> V_final. I was wondering whether there are standard terms in the
>>> literature for these two types of OOVs. I have read my share of papers,
>>> but none of them seem to make this distinction.
>>>
>>> Kind regards,
>>>
>>> Joris
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>>
>> Hi Joris,
>>
>> In my view the vocabulary is a superset of the actual set of the
>> wordforms for which all wordform sequences (the N-permutations of
>> vocabulary words, with repetion) are modeled in the N-gram LM.
>>
>> What limits the hypothesized transcript produced by an ASR system, is
>> the intersection between the sets of:
>> a. the wordforms in the pronunciation lexicon (the mapping between
>> acoustic feature sequences and orthographic representations)
>> b. the target words of the wordform sequences in the LM (as opposed to
>> history words)
>>
>> The vocabulary does not matter then: is just an optional means to
>> constrain the potential richness (given the written training data) of
>> an N-gram LM that you are creating. You can use a vocabulary as a
>> constraint ('-limit-vocab' in' ngram-count'), and/or use it to
>> facilitate a preprocessed form of training data by means of special
>> tokens that aren't really words (such as "<unk>" or a 'proper name
>> class' token).
>>
>> So, the vocabulary may contain superfluous words. Only after you
>> realize that this is not an issue, you could think about it further
>> and say that after you have created and pruned an LM, you can find out
>> which words were actually redundant in your vocabulary given the same
>> written training data you used to create that LM, and you could just
>> as well drop those and those words from the vocabulary you had already
>> before creating your LM. Maybe that reduces the size of your
>> vocabulary as much as you hope. Will this be worthwhile? Not for the
>> ASR task, you see.
>>
>> The term OOV comes in handy as shorthand to denote words that are in
>> the written training data but not in the vocabulary. It is not
>> precise, you could just as well use an element-out-of-set notation
>> (short and clear) in reports. Maybe you have read the article:
>> "Detection of OOV Words Using Generalized Word Models and a Semantic
>> Class Language Model" by Schaaf, which was a top Google result for me.
>> This author confuses the pronunciation lexicon with the vocabulary.
>> While you can, confusingly, call a word that was not transcribed
>> correctly because, for one, it was not modeled by the pronunciation
>> lexicon 'OOV', I think it is not okay to confuse the concepts
>> vocabulary and pronunciation lexicon as he does.
>>
>> I hope this clears up any confusion?
>>
>>
>