[SRILM User List] Google 1B Word Language Modeling Benchmark

Thu Dec 12 14:07:08 PST 2013

Ciprian Chelba asked me to forward the following information about a 
recently launched initiative in large-scale LM benchmarking.  More 
information at 
https://code.google.com/p/1-billion-word-language-modeling-benchmark/ .

Andreas

_________________________________________________________________________________________________________
Here is a brief description of the project.

"The purpose of the project is to make available a standard training and 
test setup for language modeling experiments.

The training/held-out data was produced from a download at statmt.org 
<http://statmt.org/> using a combination of Bash shell and Perl scripts 
distributed here.

This also means that your results on this data set are reproducible by 
the research community at large.

Besides the scripts needed to rebuild the training/held-out data, it 
also makes available log-probability values for each word in each of ten 
held-out data sets, for each of the following baseline models:

  * unpruned Katz (1.1B n-grams),
  * pruned Katz (~15M n-grams),
  * unpruned Interpolated Kneser-Ney (1.1B n-grams),
  * pruned Interpolated Kneser-Ney (~15M n-grams)

ArXiv paper: http://arxiv.org/abs/1312.3005

Happy benchmarking!"

-- 
-Ciprian

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20131212/09ad5492/attachment.html>