[SRILM User List] OOV terminology
Joris Pelemans
Joris.Pelemans at esat.kuleuven.be
Wed Jul 3 14:05:30 PDT 2013
Sander,
Thank you for your elaborate reply, but it doesn't really answer my
question. I am not confused about the different sets of words. I know
why they are there and what they are used for, but I'm wondering whether
there is a standard term to denote each set individually. Let me
rephrase my question with a very simple example:
Given a single training sentence, "wrong is wrong" and a language model
with cut-off 1, what are the terms to denote the following sets:
1. {wrong, is}?
2. {wrong}?
3. {is}?
4. all other English words?
I am especially interested in terms that differentiate between sets 3
and 4, if such terms exist.
Regards,
Joris
On 07/03/13 22:05, Sander Maijers wrote:
> On 03-07-13 20:22, Joris Pelemans wrote:
>> Hello all,
>>
>> My question is perhaps a little bit of topic, but I'm hoping for your
>> cooperation, since it's LM related.
>>
>> Say we have a training corpus with lexicon V_train. Since some of the
>> words have near-zero counts, we choose to exclude them from our LM. This
>> gives us a new lexicon, let's call it V_final. However this also gives
>> us two types of OOV words: those not in V_train and those not in
>> V_final. I was wondering whether there are standard terms in the
>> literature for these two types of OOVs. I have read my share of papers,
>> but none of them seem to make this distinction.
>>
>> Kind regards,
>>
>> Joris
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
> Hi Joris,
>
> In my view the vocabulary is a superset of the actual set of the
> wordforms for which all wordform sequences (the N-permutations of
> vocabulary words, with repetion) are modeled in the N-gram LM.
>
> What limits the hypothesized transcript produced by an ASR system, is
> the intersection between the sets of:
> a. the wordforms in the pronunciation lexicon (the mapping between
> acoustic feature sequences and orthographic representations)
> b. the target words of the wordform sequences in the LM (as opposed to
> history words)
>
> The vocabulary does not matter then: is just an optional means to
> constrain the potential richness (given the written training data) of
> an N-gram LM that you are creating. You can use a vocabulary as a
> constraint ('-limit-vocab' in' ngram-count'), and/or use it to
> facilitate a preprocessed form of training data by means of special
> tokens that aren't really words (such as "<unk>" or a 'proper name
> class' token).
>
> So, the vocabulary may contain superfluous words. Only after you
> realize that this is not an issue, you could think about it further
> and say that after you have created and pruned an LM, you can find out
> which words were actually redundant in your vocabulary given the same
> written training data you used to create that LM, and you could just
> as well drop those and those words from the vocabulary you had already
> before creating your LM. Maybe that reduces the size of your
> vocabulary as much as you hope. Will this be worthwhile? Not for the
> ASR task, you see.
>
> The term OOV comes in handy as shorthand to denote words that are in
> the written training data but not in the vocabulary. It is not
> precise, you could just as well use an element-out-of-set notation
> (short and clear) in reports. Maybe you have read the article:
> "Detection of OOV Words Using Generalized Word Models and a Semantic
> Class Language Model" by Schaaf, which was a top Google result for me.
> This author confuses the pronunciation lexicon with the vocabulary.
> While you can, confusingly, call a word that was not transcribed
> correctly because, for one, it was not modeled by the pronunciation
> lexicon 'OOV', I think it is not okay to confuse the concepts
> vocabulary and pronunciation lexicon as he does.
>
> I hope this clears up any confusion?
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130703/6a478e0e/attachment.html>
More information about the SRILM-User
mailing list