[SRILM User List] reproduce Penn Treebank KN5 results
Siva Reddy Gangireddy
s.gangireddy at sms.ed.ac.uk
Wed Jul 9 07:24:15 PDT 2014
Hi Joris,
Use the count cut-offs like this.
ngram-count -order 5 -text ptb.train.txt -lm templm -kndiscount
-interpolate -unk -gt3min 1 -gt4min 1
ngram -ppl ptb.test.txt -lm templm -order 5 -unk
By default SRILM uses different count cut-offs.
---
Siva
On Wed, Jul 9, 2014 at 11:03 PM, Joris Pelemans <
Joris.Pelemans at esat.kuleuven.be> wrote:
> Hi all,
>
> I'm trying to reproduce some reported N-gram perplexity results on the
> Penn Treebank with SRILM, but somehow my results are always different by a
> large degree. Since I will be interpolating with these models and comparing
> the interpolated model with others, I would really prefer to start on the
> same level :-).
>
> The data set I'm using is the one that comes with Mikolov's RNNLM toolkit
> and applies the same processing of data as used in many LM papers,
> including "Empirical Evaluation and Combination of Advanced Language
> Modeling Techniques". In that paper, Mikolov et al report a KN5 perplexity
> of 141.2. It's not entirely clear (1) whether they ignore OOV words or
> simply use the <unk> probability; and (2) whether it's a back-off or
> interpolated model, but I assume the latter as this has been reported as
> best many times. They do report using SRILM and no count cut-offs.
>
> I have tried building the same model in many ways:
>
> *regular:* ngram-count -order 5 -text data/penn/ptb.train.txt -lm
> models/ptb.train_5-gram_kn.arpa2 -kndiscount -interpolate
> *open vocab:* ngram-count -order 5 -text data/penn/ptb.train.txt -lm
> models/ptb.train_5-gram_kn.arpa3 -kndiscount -interpolate -unk
> *no sentence markers:* ngram-count -order 5 -text data/penn/ptb.train.txt
> -lm models/ptb.train_5-gram_kn.arpa4 -kndiscount -interpolate -no-sos
> -no-eos
> *open vocab + no sentence markers:* ngram-count -order 5 -text
> data/penn/ptb.train.txt -lm models/ptb.train_5-gram_kn.arpa5 -kndiscount
> -interpolate -unk -no-sos -no-eos
> *back-off (just in case**):* ngram-count -order 5 -text
> data/penn/ptb.train.txt -lm models/ptb.train_5-gram_kn.arpa5 -kndiscount
> -unk
>
> None of them however, give me a perplexity of 141.2:
>
> <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt -lm
> models/ptb.train_5-gram_kn.arpa2 -order 5
> file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 4794 OOVs
> 0 zeroprobs, logprob= -172723 ppl= 167.794 ppl1= 217.791
>
> <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt -lm
> models/ptb.train_5-gram_kn.arpa3 -order 5 -unk
> file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs
> 0 zeroprobs, logprob= -178859 ppl= 147.852 ppl1= 187.743
>
> <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt -lm
> models/ptb.train_5-gram_kn.arpa4 -order 5
> file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 4794 OOVs
> 0 zeroprobs, logprob= -179705 ppl= 206.4 ppl1= 270.74
>
> <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt -lm
> models/ptb.train_5-gram_kn.arpa5 -order 5 -unk
> file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs
> 0 zeroprobs, logprob= -186444 ppl= 182.746 ppl1= 234.414
>
> <jpeleman at spchcl23:~/exp/025> ngram -ppl data/penn/ptb.test.txt -lm
> models/ptb.train_5-gram_kn.arpa5 -order 5 -unk
> file data/penn/ptb.test.txt: 3761 sentences, 78669 words, 0 OOVs
> 0 zeroprobs, logprob= -181381 ppl= 158.645 ppl1= 202.127
>
> So... what am I missing here? 147.852 is close, but still not quite 141.2.
>
> Joris
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140709/629009b0/attachment.html>
-------------- next part --------------
An embedded and charset-unspecified text was scrubbed...
Name: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20140709/629009b0/attachment.ksh>
More information about the SRILM-User
mailing list