[SRILM User List] effect of ngram -vocab and -limit-vocab on ppl calculations
Andreas Stolcke
stolcke at speech.sri.com
Mon Feb 7 23:26:02 PST 2011
zeeshan khan wrote:
> Hi all,
>
> I wanted to share my observation regarding the SRILM toolkit's
> calculation of perplexities and the effect of -vocab and -limit-vocab
> on it, and wanted to know why this happens.
>
>
> SRILM toolkit's ngram tool gives 3 different perplexities of the SAME
> text if these options are used as follows.
>
> P1: ngram -unk -map-unk '[UNKNOWN]' -order 4 -lm <LM-FILE> -ppl
> <TEXT-FILE> : gives the highest perplexity value
>
> P2: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE> -order 4 -lm
> <LM-FILE> -ppl <TEXT-FILE> : gives perplexity value lesser than P1 and
> greater than P3.
That's probably because your <VOCAB-FILE> contains more words than the
LM itself. That means fewer words are mapped to '[UNKNOWN]' and this
changes which probabilities are looked up in the LM. If however your
<VOCAB-FILE> contains a subset of the vocabulary in the LM itself then
there should be no change in perplexity.
>
> P3: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE> -limit-vocab
> -order 4 -lm <LM-FILE> -ppl <TEXT-FILE> : gives perplexity value
> smaller than both P1 and P2.
This has the effect that only ngrams covered by the words in
<VOCAB-FILE> are read from the LM.
Presumably more words are now mapped to [UNKNOWN], but it's hard to
predict what happens to perplexity because you don't say what the
relationship between the vocabulary and the data in <TEXT-FILE> is.
The purpose of -limit-vocab is to all and only the portions of the LM
that are needed by the input data. Therefore, to make meaningful use of
this option you need to generate the vocabulary from the <TEXT-FILE> in
this case.
>
> Can anyone tell me why this happens ? I thought the effect of -vocab
> and -limit-vocab options is only on memory usage.
A good way to track down the differences is to use -debug 2, capture the
output in files, and use diff to see where they differ.
Andreas
>
>
> Just for information, the VOCAB files are generated from lattice files
> generated during a recognition process.
>
>
> Thanks and Regards,
>
>
> Zeeshan.
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
More information about the SRILM-User
mailing list