Naive question about unknown words
Anand Venkataraman (Roaming)
anand at speech.sri.com
Tue Oct 11 09:19:19 PDT 2005
Arnaud
When you created the language model, you specified that you wanted to
create an unknown word (placeholder for out-of-vocabulary items) with a
non-zero probability. Since you didn't invoke ngram also with the -unk
option, it warns that you are using a supposedly closed vocabulary lm,
but that it has a non-zero prob for unk. You can avoid it by specifying
-unk for ngram as well, or alternately, building a closed vocab lm to
start with (i.e. ngram-count without -unk). Although you state that you
want to have a non-zero weight for unknown unigrams, I would recommend
that if at all possible, you predetermine the domain vocab and build a
closed vocab LM.
Regards
&
gaudinat wrote:
> Sorry for this naive question:
>
> I create my LM with this command:
> ngram-count -text learningdb.txt -lm GT -unk
>
> I evaluate a sentence with the following command:
> ngram -lm GT -ppl sentence.txt
>
> I obtain coherent results but I get also the following warning message:
> "warning: non-zero probability for <unk> in closed-vocabulary LM"
>
> Can anyone give me some information about this warning and how to avoid it?
> Of course I need to give a weight for the unknown words.
>
> Thanks in advance,
>
> Arnaud.
More information about the SRILM-User
mailing list