[SRILM User List] ngram-count: -skip in combination with -unk
Andreas Stolcke
stolcke at icsi.berkeley.edu
Mon Jun 10 15:37:30 PDT 2013
On 6/10/2013 2:55 PM, Sander Maijers wrote:
> What is the interaction between the "-unk" and "-skip" parameters to
> 'ngram-count' when creating an LM given a word list that fully covers
> the training words?
>
> According to srilm-faq.7, the precise interaction in terms of backoff
> strategy when a test word sequence is looked up that has no
> corresponding N-gram in the LM depends on the particular backoff scheme.
The effect of -unk is very specific: it allows including ngrams
involving the <unk> word in the LM.
Without it, words not contained in the vocabulary are still mapped to
<unk> but then discarded from the model.
Using ngram-count -unk is usually used when -vocab is also specified.
Otherwise all words are implicitly added to the vocabulary and you
wouldn't see any <unk> occurrences. The same is true if your word list
contains all the words in your training data: you won't see any ngrams
containing the <unk> word
(unless the input data already contains them, which is another way to
structure your data processing).
The way regular backoff and the "skip" ngram operate is really
orthogonal to the above. Words are either mapped to themselves or to
<unk> , but once that is done the model (in backing off, mixing regular
and skip ngram estimates) doesn't nothing special with <unk>.
If not clear, maybe you could give a specific example and we can walk
through it.
Andreas
More information about the SRILM-User
mailing list