[SRILM User List] ngram-count: -skip in combination with -unk
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Jun 11 15:05:17 PDT 2013
On 6/11/2013 4:07 PM, Sander Maijers wrote:
> On 11-06-13 00:37, Andreas Stolcke wrote:
>> On 6/10/2013 2:55 PM, Sander Maijers wrote:
>>> What is the interaction between the "-unk" and "-skip" parameters to
>>> 'ngram-count' when creating an LM given a word list that fully covers
>>> the training words?
>>>
>>> According to srilm-faq.7, the precise interaction in terms of backoff
>>> strategy when a test word sequence is looked up that has no
>>> corresponding N-gram in the LM depends on the particular backoff
>>> scheme.
>>
>> The effect of -unk is very specific: it allows including ngrams
>> involving the <unk> word in the LM.
>> Without it, words not contained in the vocabulary are still mapped to
>> <unk> but then discarded from the model.
>>
>> Using ngram-count -unk is usually used when -vocab is also specified.
>> Otherwise all words are implicitly added to the vocabulary and you
>> wouldn't see any <unk> occurrences. The same is true if your word list
>> contains all the words in your training data: you won't see any ngrams
>> containing the <unk> word
>> (unless the input data already contains them, which is another way to
>> structure your data processing).
>>
>> The way regular backoff and the "skip" ngram operate is really
>> orthogonal to the above. Words are either mapped to themselves or to
>> <unk> , but once that is done the model (in backing off, mixing regular
>> and skip ngram estimates) doesn't nothing special with <unk>.
>>
>> If not clear, maybe you could give a specific example and we can walk
>> through it.
>>
>> Andreas
>
> Suppose ocurrences of "a b c" have added to the count for "a b <unk>"
> in certain LM. Suppose that the 3-gram for "a b c" is looked up. It
> would match "a b <unk>" and not back off (no need, because a matching
> N-gram was found). Conversely, suppose "a b <unk>" is not in the LM.
> Then backing off would be attempted.
>
> I think that my word list fully covers the training data in itself.
> Then it wouldn't matter whether I created the LM with or without the
> '-unk' parameter. But if there instead are a few OOV words in the
> training data, then specifying '-unk' and '-skip' means that in cases
> like in my previous "a b c" example no back off would be performed. In
> fact, all test word sequences "a b (OOV)" will get the same
> probability estimate, the probability estimate for "a b <unk>".
>
> Is the above reasoning entirely correct, or not? This reasoning is
> what made me ask this question. If this is the case, then it would be
> better that I not use '-unk' as my goal is to compare two language
> models, one with backoff and one without.
Your reasoning is correct.
Andreas
More information about the SRILM-User
mailing list