[SRILM User List] effect of ngram -vocab and -limit-vocab on ppl calculations

Andreas Stolcke stolcke at speech.sri.com
Mon Feb 7 23:26:02 PST 2011


zeeshan khan wrote:
> Hi all, 
>
> I wanted to share my observation regarding the SRILM toolkit's 
> calculation of perplexities and the effect of  -vocab and -limit-vocab 
> on it, and wanted to know why this happens.
>
>
> SRILM toolkit's ngram tool gives 3 different perplexities of the SAME 
> text if these options are used as follows. 
>
> P1: ngram -unk -map-unk '[UNKNOWN]'  -order 4 -lm <LM-FILE> -ppl 
> <TEXT-FILE> : gives the highest perplexity value
>
> P2: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE> -order 4 -lm 
> <LM-FILE> -ppl <TEXT-FILE> : gives perplexity value lesser than P1 and 
> greater than P3.
That's probably because your <VOCAB-FILE> contains more words than the 
LM itself.  That means fewer words are mapped to '[UNKNOWN]' and this 
changes which probabilities are looked up in the LM.  If however your 
<VOCAB-FILE>  contains a subset of the vocabulary in the LM itself then 
there should be no change in perplexity.  

>
> P3: ngram -unk -map-unk '[UNKNOWN]' -vocab <VOCAB-FILE> -limit-vocab 
> -order 4 -lm <LM-FILE> -ppl <TEXT-FILE> : gives perplexity value 
> smaller than both P1 and P2.
This has the effect that only ngrams covered by the words in 
<VOCAB-FILE> are read from the LM.
Presumably more words are now mapped to [UNKNOWN], but it's hard to 
predict what happens to perplexity because you don't say what the 
relationship between the vocabulary and the data in <TEXT-FILE> is.
The purpose of -limit-vocab is to all and only the portions of the LM 
that are needed by the input data.  Therefore, to make meaningful use of 
this option you need to generate the vocabulary from the <TEXT-FILE> in 
this case.
>
> Can anyone tell me why this happens ? I thought the effect of -vocab 
> and -limit-vocab options is only on memory usage.
A good way to track down the differences is to use -debug 2, capture the 
output in files, and use diff to see where they differ.

Andreas

>
>
> Just for information, the VOCAB files are generated from lattice files 
> generated during a recognition process.
>
>
> Thanks and Regards,
>
>
> Zeeshan.
> ------------------------------------------------------------------------
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user




More information about the SRILM-User mailing list