[SRILM User List] Google 1B Word Language Modeling Benchmark
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Dec 12 14:07:08 PST 2013
Ciprian Chelba asked me to forward the following information about a
recently launched initiative in large-scale LM benchmarking. More
information at
https://code.google.com/p/1-billion-word-language-modeling-benchmark/ .
Andreas
_________________________________________________________________________________________________________
Here is a brief description of the project.
"The purpose of the project is to make available a standard training and
test setup for language modeling experiments.
The training/held-out data was produced from a download at statmt.org
<http://statmt.org/> using a combination of Bash shell and Perl scripts
distributed here.
This also means that your results on this data set are reproducible by
the research community at large.
Besides the scripts needed to rebuild the training/held-out data, it
also makes available log-probability values for each word in each of ten
held-out data sets, for each of the following baseline models:
* unpruned Katz (1.1B n-grams),
* pruned Katz (~15M n-grams),
* unpruned Interpolated Kneser-Ney (1.1B n-grams),
* pruned Interpolated Kneser-Ney (~15M n-grams)
ArXiv paper: http://arxiv.org/abs/1312.3005
Happy benchmarking!"
--
-Ciprian
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131212/09ad5492/attachment.html>
More information about the SRILM-User
mailing list