[SRILM User List] ngram-count: -skip in combination with -unk

Mon Jun 10 15:37:30 PDT 2013

On 6/10/2013 2:55 PM, Sander Maijers wrote:
> What is the interaction between the "-unk" and "-skip" parameters to 
> 'ngram-count' when creating an LM given a word list that fully covers 
> the training words?
>
> According to srilm-faq.7, the precise interaction in terms of backoff 
> strategy when a test word sequence is looked up that has no 
> corresponding N-gram in the LM depends on the particular backoff scheme.

The effect of -unk is very specific:  it allows including ngrams 
involving the <unk>  word in the LM.
Without it, words not contained in the vocabulary are still mapped to 
<unk>  but then discarded from the model.

Using ngram-count -unk is usually used when -vocab is also specified.  
Otherwise all words are implicitly added to the vocabulary and you 
wouldn't see any <unk> occurrences.  The same is true if your word list 
contains all the words in your training data: you won't see any ngrams 
containing the  <unk> word
(unless the input data already contains them, which is another way to 
structure your data processing).

The way regular backoff and the "skip" ngram operate is really 
orthogonal to the above.  Words are either mapped to themselves or to 
<unk> , but once that is done the model (in backing off, mixing regular 
and skip ngram estimates) doesn't nothing special with <unk>.

If not clear, maybe you could give a specific example and we can walk 
through it.

Andreas