[SRILM User List] question about using the Google Web N-gram corpus to build an LM

Andreas Stolcke stolcke at icsi.berkeley.edu
Thu Aug 15 14:15:14 PDT 2013


On 8/15/2013 12:15 AM, HU Rile wrote:
> Hi,
> I would like to build an LM using the Google Web 1T corpus. And I 
> followed the steps on 
> http://www-speech.sri.com/projects/srilm/manpages/srilm-faq.7.html. 
> But when I used ngram-count to estimate the mixture weights, the 
> program can not run and gave the response "google.countlm.0: line 22: 
> reached EOF before \end\
> format error in init-lm file".
> I tried to add \end\ to the end of googl! e.countlm.0, but it did not 
> work.
> Here is the content of my google.countlm.0:
> order 3
> vocabsize 13588391
> totalcount 1024908267229
> countmodulus 40
> mixweights 15
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
>  0.5 0.5 0.5
> google-counts /home/hurile/googleweb1T/google! LM/
>
> Could someone please tell me how can i so lve the problem? Thanks a lot!
>
> Rile Hu
>
You probably forgot the -count-lm option.   Without it, ngram-count will 
try to interpret the -lm file as a standard ngram LM (where the \end\ 
line is expected).

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130815/9318d1c5/attachment.html>


More information about the SRILM-User mailing list