Naive question about unknown words

Tue Oct 11 09:19:19 PDT 2005

Arnaud

When you created the language model, you specified that you wanted to 
create an unknown word (placeholder for out-of-vocabulary items) with a 
non-zero probability.  Since you didn't invoke ngram also with the -unk 
option, it warns that you are using a supposedly closed vocabulary lm, 
but that it has a non-zero prob for unk.  You can avoid it by specifying 
-unk for ngram as well, or alternately, building a closed vocab lm to 
start with (i.e. ngram-count without -unk).  Although you state that you 
want to have a non-zero weight for unknown unigrams, I would recommend 
that if at all possible, you predetermine the domain vocab and build a 
closed vocab LM.

Regards

&

gaudinat wrote:
> Sorry for this naive question:
> 
> I create my LM with this command:
> ngram-count  -text learningdb.txt -lm GT -unk
> 
> I evaluate a sentence with the following command:
> ngram -lm GT -ppl sentence.txt
> 
> I obtain coherent results but I get also the following warning message:
> "warning: non-zero probability for <unk> in closed-vocabulary LM"
> 
> Can anyone give me some information about this warning and how to avoid it?
> Of course I need to give a weight for the unknown words.
> 
> Thanks in advance,
> 
> Arnaud.