[SRILM User List] OOV terminology

Wed Jul 3 14:05:30 PDT 2013

Sander,

Thank you for your elaborate reply, but it doesn't really answer my 
question. I am not confused about the different sets of words. I know 
why they are there and what they are used for, but I'm wondering whether 
there is a standard term to denote each set individually. Let me 
rephrase my question with a very simple example:

Given a single training sentence, "wrong is wrong" and a language model 
with cut-off 1, what are the terms to denote the following sets:

 1. {wrong, is}?
 2. {wrong}?
 3. {is}?
 4. all other English words?

I am especially interested in terms that differentiate between sets 3 
and 4, if such terms exist.

Regards,

Joris

On 07/03/13 22:05, Sander Maijers wrote:
> On 03-07-13 20:22, Joris Pelemans wrote:
>> Hello all,
>>
>> My question is perhaps a little bit of topic, but I'm hoping for your
>> cooperation, since it's LM related.
>>
>> Say we have a training corpus with lexicon V_train. Since some of the
>> words have near-zero counts, we choose to exclude them from our LM. This
>> gives us a new lexicon, let's call it V_final. However this also gives
>> us two types of OOV words: those not in V_train and those not in
>> V_final. I was wondering whether there are standard terms in the
>> literature for these two types of OOVs. I have read my share of papers,
>> but none of them seem to make this distinction.
>>
>> Kind regards,
>>
>> Joris
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
> Hi Joris,
>
> In my view the vocabulary is a superset of the actual set of the 
> wordforms for which all wordform sequences (the N-permutations of 
> vocabulary words, with repetion) are modeled in the N-gram LM.
>
> What limits the hypothesized transcript produced by an ASR system, is 
> the intersection between the sets of:
> a. the wordforms in the pronunciation lexicon (the mapping between 
> acoustic feature sequences and orthographic representations)
> b. the target words of the wordform sequences in the LM (as opposed to 
> history words)
>
> The vocabulary does not matter then: is just an optional means to 
> constrain the potential richness (given the written training data) of 
> an N-gram LM that you are creating. You can use a vocabulary as a 
> constraint ('-limit-vocab' in' ngram-count'), and/or use it to 
> facilitate a preprocessed form of training data by means of special 
> tokens that aren't really words (such as "<unk>" or a 'proper name 
> class' token).
>
> So, the vocabulary may contain superfluous words. Only after you 
> realize that this is not an issue, you could think about it further 
> and say that after you have created and pruned an LM, you can find out 
> which words were actually redundant in your vocabulary given the same 
> written training data you used to create that LM, and you could just 
> as well drop those and those words from the vocabulary you had already 
> before creating your LM. Maybe that reduces the size of your 
> vocabulary as much as you hope. Will this be worthwhile? Not for the 
> ASR task, you see.
>
> The term OOV comes in handy as shorthand to denote words that are in 
> the written training data but not in the vocabulary. It is not 
> precise, you could just as well use an element-out-of-set notation 
> (short and clear) in reports. Maybe you have read the article: 
> "Detection of OOV Words Using Generalized Word Models and a Semantic 
> Class Language Model" by Schaaf, which was a top Google result for me. 
> This author confuses the pronunciation lexicon with the vocabulary. 
> While you can, confusingly, call a word that was not transcribed 
> correctly because, for one, it was not modeled by the pronunciation 
> lexicon 'OOV', I think it is not okay to confuse the concepts 
> vocabulary and pronunciation lexicon as he does.
>
> I hope this clears up any confusion?
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130703/6a478e0e/attachment.html>