[SRILM User List] ngram-count: -skip in combination with -unk

Tue Jun 11 15:05:17 PDT 2013

On 6/11/2013 4:07 PM, Sander Maijers wrote:
> On 11-06-13 00:37, Andreas Stolcke wrote:
>> On 6/10/2013 2:55 PM, Sander Maijers wrote:
>>> What is the interaction between the "-unk" and "-skip" parameters to
>>> 'ngram-count' when creating an LM given a word list that fully covers
>>> the training words?
>>>
>>> According to srilm-faq.7, the precise interaction in terms of backoff
>>> strategy when a test word sequence is looked up that has no
>>> corresponding N-gram in the LM depends on the particular backoff 
>>> scheme.
>>
>> The effect of -unk is very specific:  it allows including ngrams
>> involving the <unk>  word in the LM.
>> Without it, words not contained in the vocabulary are still mapped to
>> <unk>  but then discarded from the model.
>>
>> Using ngram-count -unk is usually used when -vocab is also specified.
>> Otherwise all words are implicitly added to the vocabulary and you
>> wouldn't see any <unk> occurrences.  The same is true if your word list
>> contains all the words in your training data: you won't see any ngrams
>> containing the  <unk> word
>> (unless the input data already contains them, which is another way to
>> structure your data processing).
>>
>> The way regular backoff and the "skip" ngram operate is really
>> orthogonal to the above.  Words are either mapped to themselves or to
>> <unk> , but once that is done the model (in backing off, mixing regular
>> and skip ngram estimates) doesn't nothing special with <unk>.
>>
>> If not clear, maybe you could give a specific example and we can walk
>> through it.
>>
>> Andreas
>
> Suppose ocurrences of "a b c" have added to the count for "a b <unk>" 
> in certain LM. Suppose that the 3-gram for "a b c" is looked up. It 
> would match "a b <unk>" and not back off (no need, because a matching 
> N-gram was found). Conversely, suppose "a b <unk>" is not in the LM. 
> Then backing off would be attempted.
>
> I think that my word list fully covers the training data in itself. 
> Then it wouldn't matter whether I created the LM with or without the 
> '-unk' parameter. But if there instead are a few OOV words in the 
> training data, then specifying '-unk' and '-skip' means that in cases 
> like in my previous "a b c" example no back off would be performed. In 
> fact, all test word sequences "a b (OOV)" will get the same 
> probability estimate, the probability estimate for "a b <unk>".
>
> Is the above reasoning entirely correct, or not? This reasoning is 
> what made me ask this question. If this is the case, then it would be 
> better that I not use '-unk' as my goal is to compare two language 
> models, one with backoff and one without.
Your reasoning is correct.

Andreas