[SRILM User List] OOV terminology

Long Qin lqin at cs.cmu.edu
Wed Jul 3 13:35:38 PDT 2013


Hi Joris,

As far as I know, there is no standard common term to distinguish OOV
words that appearing in the LM training data but cutoffed and OOV words
not in the data.

Generally, the vocabulary of a recognizer is the mutual share of words
between its lexicon and LM. From that point of view, those two types of
OOVs will have the same effect on recognition - the ASR system cannot
recognize them. But for OOV word detection, normally it is easier to
detect OOV words which appear in the traing text but not in the
vocabulary. Because we know the pronunciation of those words and we know
where in a sentence they may appear.

Thanks,
Long

On Wed, July 3, 2013 4:18 pm, yangyang shi wrote:
> Hi Joris,
>
>
> Is this a type of cut-off? If you set cut-off == 3, that means the words
> occurs less than 3 times will be considered as OOV.
>
> Cheers,
>
>
> Yangyang Shi
>
>
>
> On Wed, Jul 3, 2013 at 8:22 PM, Joris Pelemans <
> Joris.Pelemans at esat.kuleuven.be> wrote:
>
>
>> Hello all,
>>
>>
>> My question is perhaps a little bit of topic, but I'm hoping for your
>> cooperation, since it's LM related.
>>
>> Say we have a training corpus with lexicon V_train. Since some of the
>> words have near-zero counts, we choose to exclude them from our LM. This
>>  gives us a new lexicon, let's call it V_final. However this also gives
>> us two types of OOV words: those not in V_train and those not in
>> V_final. I
>> was wondering whether there are standard terms in the literature for
>> these two types of OOVs. I have read my share of papers, but none of
>> them seem to make this distinction.
>>
>> Kind regards,
>>
>>
>> Joris
>> ______________________________**_________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/**mailman/listinfo/srilm-user<http://www.speec
>> h.sri.com/mailman/listinfo/srilm-user>
>>
>
>
>
> --
> Met vriendelijke groet,
>
>
> Yangyang Shi
>
>
> TU Delft / Interactive Intelligence Group
> HB12.290, EWI,
> Mekelweg 4,
> 2628 CD Delft,
> T +31 (0) 152782549
> E shiyang1983 at gmail.com; yangyangshi at ieee.org
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user




More information about the SRILM-User mailing list