[SRILM User List] reproduce Penn Treebank KN5 results

Thu Jul 10 01:43:57 PDT 2014

Hi Siva,

Thanks a lot, with these arguments the perplexity is very close to the 
reported 141.2 (still not entirely the same though):

<jpeleman at spchcl23:~/exp/025> ngram-count -order 5 -text 
data/penn/ptb.train.txt -lm models/ptb.train_5-gram_kn.arpa7 -kndiscount 
-interpolate -unk -gt3min 1 -gt4min 1
<jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt -lm 
models/ptb.train_5-gram_kn.arpa7 -order 5 -unk
file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs
0 zeroprobs, logprob= -177278 ppl= *141.464* ppl1= 179.251

I wonder about the value of experiments that include <unk> in the 
perplexity calculation. Does it not make the problem a lot easier 
(predicting a huge class is not hard - imagine mapping all words to 
<unk>) and as such yield misleading results?

Joris

On 07/09/14 16:24, Siva Reddy Gangireddy wrote:
> Hi Joris,
>
> Use the count cut-offs like this.
>
> ngram-count -order 5 -text ptb.train.txt -lm templm -kndiscount 
> -interpolate -unk -gt3min 1 -gt4min 1
> ngram -ppl ptb.test.txt -lm templm -order 5 -unk
>
> By default SRILM uses different count cut-offs.
>
> ---
> Siva
>
>
>
> On Wed, Jul 9, 2014 at 11:03 PM, Joris Pelemans 
> <Joris.Pelemans at esat.kuleuven.be 
> <mailto:Joris.Pelemans at esat.kuleuven.be>> wrote:
>
>     Hi all,
>
>     I'm trying to reproduce some reported N-gram perplexity results on
>     the Penn Treebank with SRILM, but somehow my results are always
>     different by a large degree. Since I will be interpolating with
>     these models and comparing the interpolated model with others, I
>     would really prefer to start on the same level :-).
>
>     The data set I'm using is the one that comes with Mikolov's RNNLM
>     toolkit and applies the same processing of data as used in many LM
>     papers, including "Empirical Evaluation and Combination of
>     Advanced Language Modeling Techniques". In that paper, Mikolov et
>     al report a KN5 perplexity of 141.2. It's not entirely clear (1)
>     whether they ignore OOV words or simply use the <unk> probability;
>     and (2) whether it's a back-off or interpolated model, but I
>     assume the latter as this has been reported as best many times.
>     They do report using SRILM and no count cut-offs.
>
>     I have tried building the same model in many ways:
>
>     *regular:* ngram-count -order 5 -text data/penn/ptb.train.txt -lm
>     models/ptb.train_5-gram_kn.arpa2 -kndiscount -interpolate
>     *open vocab:* ngram-count -order 5 -text data/penn/ptb.train.txt
>     -lm models/ptb.train_5-gram_kn.arpa3 -kndiscount -interpolate -unk
>     *no sentence markers:* ngram-count -order 5 -text
>     data/penn/ptb.train.txt -lm models/ptb.train_5-gram_kn.arpa4
>     -kndiscount -interpolate -no-sos -no-eos
>     *open vocab + no sentence markers:* ngram-count -order 5 -text
>     data/penn/ptb.train.txt -lm models/ptb.train_5-gram_kn.arpa5
>     -kndiscount -interpolate -unk -no-sos -no-eos
>     *back-off (just in case**):* ngram-count -order 5 -text
>     data/penn/ptb.train.txt -lm models/ptb.train_5-gram_kn.arpa5
>     -kndiscount -unk
>
>     None of them however, give me a perplexity of 141.2:
>
>     <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt
>     -lm models/ptb.train_5-gram_kn.arpa2 -order 5
>     file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 4794 OOVs
>     0 zeroprobs, logprob= -172723 ppl= 167.794 ppl1= 217.791
>
>     <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt
>     -lm models/ptb.train_5-gram_kn.arpa3 -order 5 -unk
>     file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs
>     0 zeroprobs, logprob= -178859 ppl= 147.852 ppl1= 187.743
>
>     <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt
>     -lm models/ptb.train_5-gram_kn.arpa4 -order 5
>     file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 4794 OOVs
>     0 zeroprobs, logprob= -179705 ppl= 206.4 ppl1= 270.74
>
>     <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt
>     -lm models/ptb.train_5-gram_kn.arpa5 -order 5 -unk
>     file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs
>     0 zeroprobs, logprob= -186444 ppl= 182.746 ppl1= 234.414
>
>     <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt
>     -lm models/ptb.train_5-gram_kn.arpa5 -order 5 -unk
>     file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs
>     0 zeroprobs, logprob= -181381 ppl= 158.645 ppl1= 202.127
>
>     So... what am I missing here? 147.852 is close, but still not
>     quite 141.2.
>
>     Joris
>
>     _______________________________________________
>     SRILM-User site list
>     SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
>     http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140710/d586f093/attachment.html>