stolcke at speech.sri.com
Tue Nov 5 16:30:04 PST 2002
In message <3DC8153D.E75FFED2 at crim.ca>you wrote:
> I did many tests to find the best suited language model for a given text
> with the "ngram" program with the -prune option and I maybe have
> discovered a bug with the OOV displayed in ngram.
> With a command like:
> jarjar jfbeaumo/mlf> ngram -order 3 -vocab vocab20k.txt -unk -lm
> transtalk10.arpa -ppl test.txt
> file test.txt: 635 sentences, 9448 words, 0 OOVs
> 0 zeroprobs, logprob=3D -17926.9 ppl=3D 59.9706 ppl1=3D 78.9647
> I am always ending with 0 OOV. The language model does contain the <unk>
> token. I supposed with a sufficient large value for -prune I will begin
> to get OOV word but it is fixed on 0. If I specified an empty vocabulary
> file, again, there is 0 OOV and I suppose this isn't correct. Maybe
> ngram is taking its vocabulary from the LM but then, there will be no
> use for the switch -vocab.
> Can you help me? Did I miss something?
> Best regards,
> Jean-Fran=E7ois Beaumont - Agent de recherche (jfbeaumont at crim.ca)
> CRIM - 550, rue Sherbrooke Ouest Bureau 100 (www.crim.ca)
> Montr=E9al (Qu=E9bec) H3A 1B9 T=E9l.: 514.840-1235 #3625
it is actually a feature (not a bug) that ngram -unk counts OOVs as regular
words. They would only be counted as OOVs in the ppl output if the
LM did not contain the <unk> token, or if it had probability 0.
Of course whether this is what you expect is debatable.
You can get the OOV count you want by grepping the ngram -ppl 2 output
for "p( <unk> | ".
More information about the SRILM-User