Google Web N-gram
Andreas Stolcke
stolcke at speech.sri.com
Mon Jun 15 11:34:03 PDT 2009
Elias Majic wrote:
> Hello,
>
> First off, to save you from having to read the below, suppose I used
> make-google-ngrams to store a small corpus of text's N-gram counts on
> disk in googles format. How do I then convert this to ARPA format
> with SRILM?
You don't. There is no reason to convert a standard ngram count file
into google format for building an ARPA LM.
Converting the counts into a different format won't help you deal with
any memory issues.
SRILM currently is just not set up to estimate ARPA LMs of the size
implied by the google corpus.
That's why we created the count-LM approach, that can make use of the
google ngram files directly.
The estimation process is described in the FAQ, as you know.
If you want to build a very large backoff LMs there are a few other LM
tools out there that are explicitly targeted at large data sets. Try
googling "MSRLM" and "IRSTLM". I doubt that even if you were able to
build a traditional ARPA LM from all the google ngrams it would do you
much good -- it would take way too long to load into memory, even if
only a subset were used. That's why MSRLM, for example, uses a
server-based approach.
Andreas
>
> I have read the Google Web N-gram section in the F.A.Q, I read all the
> emails with the search term google in it and I read all the relevant
> man pages as well as looked at relevant run-tests without success.
>
> My goal is to make an arpa format language model from the N-gram
> counts inside the Google Web N-gram corpus. I realize its too large
> to load into memory as discussed in the documentation, so as per one
> of the emails in the list suggested, I pruned out most of the junk or
> non dictionary words and merged different cases and fixed the config
> files. So now I reduced the data quite significantly and am unable to
> figure out how to convert it to arpa format. Below is what I tried:
>
> 1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM
>
> This did not work. It produced the same duplicate file of google.countlm
>
> 2. I noticed in the man pages that using the command -expand-classes
> forced the output to be a single ngram model in ARPA format. So I tried:
> ngram -order 5 -count-lm -lm google.countlm -expand-classes 5
> -write-lm arpaLM
> Nothing happened but the output:
> HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden
> N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram
> models are mutually exclusive
>
> 3.I thought maybe using mix-lm would result in an arpa model as it
> also says in the man pages this would occur with mix-lm. I realize
> this was unlikely to work as I am combining the same lm's but tried
> regardless.
> ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm
> google.countlm -write-lm arpaLM
> Output was the same as google.countlm
>
> I tried other things like using ngram-count and running the lm-scripts
> but no dice. One of the relevant posts in the forum I posted below:
>
> http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html
> The URL above mentions:
> *
> />> Could you give me an *example* about bulilding google 3-gram LM file
> >> ,please?
> >>
> >Again, this will require using the option with some tricks
> >that are not documents
> >as yet. Please be patient (or read all the manual pages carefully to
> >figure it our yourself.)/*
> *
> *Has any documentations been made regarding this? Did the trick infer
> using mix-lm or expand-classes to force arpa format?
>
> I figure worst case I do it manually but am sure there is something in
> SRILM that I am missing.
>
> Thanks
> Elias
More information about the SRILM-User
mailing list