-unk flag

Mon Sep 6 13:54:16 PDT 2004

In message <4138A990.5060500 at csail.mit.edu>you wrote:
> Could someone please tell me what the -unk flag will do to the probability
> model? It seems that, with the -unk flag, the language model will give a very
> good probability to unknown words, even when the training sentences don't
> contain any unknown words. In fact, I found that the probability for a senten
> ce
> in the training data is inferior to that of a sentence composed entirely of
> unknown words (the number of words are the same in the two sentences). This i
> s 
> quite expected.

ngram-count -unk  builds an LM that has <unk> as a word type and assigns
non-zero probability to it (the default is not to include <unk> in the LM).
All words not in the training data not listed in the -vocab file are mapped
to <unk>.  This is what is commonly known as an "open-vocabulary" LM.

ngram -unk should be used to evaluate an LM that contains <unk>.
(A warning will be issued if -unk is no specified and the LM contains
a non-zero probability for <unk>).

The behavior you describe is certainly not expected.  But to figure out 
why it happens one would have to look at the data and exact command
invocations you are using.

--Andreas