Google Web N-gram

Mon Jun 15 11:34:03 PDT 2009

Elias Majic wrote:
> Hello,
>
> First off, to save you from having to read the below, suppose I used 
> make-google-ngrams to store a small corpus of text's N-gram counts on 
> disk in googles format.  How do I then convert this to ARPA format 
> with SRILM?
You don't.  There is no reason to convert a standard ngram count file 
into google format for building an ARPA LM.
Converting the counts into a different format won't help you deal with 
any memory issues.
SRILM currently is just not set up to estimate ARPA LMs of the size 
implied by the google corpus.
That's why we created the count-LM approach, that can make use of the 
google ngram files directly.
The estimation process is described in the FAQ, as you know.

If you want to build a very large backoff LMs there are a few other LM 
tools out there that are explicitly targeted at  large data sets.   Try 
googling "MSRLM"  and "IRSTLM".    I doubt that even if you were able to 
build a traditional ARPA LM from all the google ngrams it would do you 
much good -- it would take way too long to load into memory, even if 
only a subset were used.  That's why MSRLM, for example, uses a 
server-based approach.

Andreas

>
> I have read the Google Web N-gram section in the F.A.Q, I read all the 
> emails with the search   term google in it and I read all the relevant 
> man pages as well as looked at relevant run-tests without success.
>
> My goal is to make an arpa format language model from the N-gram 
> counts inside the Google Web N-gram corpus.  I realize its too large 
> to load into memory as discussed in the documentation, so as per one 
> of the emails in the list suggested, I pruned out most of the junk or 
> non dictionary words and merged different cases and fixed the config 
> files.  So now I reduced the data quite significantly and am unable to 
> figure out how to convert it to arpa format.  Below is what I tried:
>
> 1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM
>
> This did not work. It produced the same duplicate file of google.countlm
>
> 2. I noticed in the man pages that using the command -expand-classes 
> forced the output to be a single ngram model in ARPA format. So I tried:
> ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 
> -write-lm arpaLM
> Nothing happened but the output:
> HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden 
> N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram 
> models are mutually exclusive
>
> 3.I thought maybe using mix-lm would result in an arpa model as it 
> also says in the man pages this would occur with mix-lm. I realize 
> this was unlikely to work as I am combining the same lm's but tried 
> regardless.
> ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm 
> google.countlm -write-lm arpaLM
> Output was the same as google.countlm
>
> I tried other things like using ngram-count and running the lm-scripts 
> but no dice.  One of the relevant posts in the forum I posted below:
>
> http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html
> The URL above mentions:
> *
> />> Could you give me an *example* about bulilding google 3-gram LM file
> >> ,please?
> >>  
> >Again, this will require using the  option with some tricks
> >that are not documents
> >as yet. Please be patient (or read all the manual pages carefully to
> >figure it our yourself.)/*
> *
> *Has any documentations been made regarding this? Did the trick infer 
> using mix-lm or expand-classes to force arpa format? 
>
> I figure worst case I do it manually but am sure there is something in 
> SRILM that I am missing.
>
> Thanks
> Elias