SRILM 1.3.2

Tue Nov 5 16:30:04 PST 2002

In message <3DC8153D.E75FFED2 at crim.ca>you wrote:
> 
> Hi,
> 
> I did many tests to find the best suited language model for a given text
> with the "ngram" program with the -prune option and I maybe have
> discovered a bug with the OOV displayed in ngram.
> 
> With a command like:
> jarjar jfbeaumo/mlf> ngram -order 3 -vocab vocab20k.txt -unk -lm
> transtalk10.arpa -ppl test.txt
> file test.txt: 635 sentences, 9448 words, 0 OOVs
> 0 zeroprobs, logprob=3D -17926.9 ppl=3D 59.9706 ppl1=3D 78.9647
> 
> I am always ending with 0 OOV. The language model does contain the <unk>
> token. I supposed with a sufficient large value for -prune I will begin
> to get OOV word but it is fixed on 0. If I specified an empty vocabulary
> file, again, there is 0 OOV and I suppose this isn't correct. Maybe
> ngram is taking its vocabulary from the LM but then, there will be no
> use for the switch -vocab.
> 
> Can you help me? Did I miss something?
> 
> Best regards,
> 
> JF
> --
> Jean-Fran=E7ois Beaumont - Agent de recherche (jfbeaumont at crim.ca)
> CRIM - 550, rue Sherbrooke Ouest Bureau 100 (www.crim.ca)
> Montr=E9al (Qu=E9bec) H3A 1B9  T=E9l.: 514.840-1235 #3625

Dear JF,

it is actually a feature (not a bug) that ngram -unk counts OOVs as regular
words.   They would only be counted as OOVs in the ppl output if the
LM did not contain the <unk> token, or if it had probability 0.
Of course whether this is what you expect is debatable. 
You can get the OOV count you want by grepping the ngram -ppl 2 output
for "p( <unk> | ".

--Andreas