Google Web N-gram

Elias Majic elias.majic at gmail.com
Sat Jun 13 11:42:00 PDT 2009


Hello,

First off, to save you from having to read the below, suppose I used
make-google-ngrams to store a small corpus of text's N-gram counts on disk
in googles format.  How do I then convert this to ARPA format with SRILM?

I have read the Google Web N-gram section in the F.A.Q, I read all the
emails with the search   term google in it and I read all the relevant man
pages as well as looked at relevant run-tests without success.

My goal is to make an arpa format language model from the N-gram counts
inside the Google Web N-gram corpus.  I realize its too large to load into
memory as discussed in the documentation, so as per one of the emails in the
list suggested, I pruned out most of the junk or non dictionary words and
merged different cases and fixed the config files.  So now I reduced the
data quite significantly and am unable to figure out how to convert it to
arpa format.  Below is what I tried:

1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM

This did not work. It produced the same duplicate file of google.countlm

2. I noticed in the man pages that using the command -expand-classes forced
the output to be a single ngram model in ARPA format. So I tried:
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -write-lm
arpaLM
Nothing happened but the output:
HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden
N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram models are
mutually exclusive

3.I thought maybe using mix-lm would result in an arpa model as it also says
in the man pages this would occur with mix-lm. I realize this was unlikely
to work as I am combining the same lm's but tried regardless.
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm
google.countlm -write-lm arpaLM
Output was the same as google.countlm

I tried other things like using ngram-count and running the lm-scripts but
no dice.  One of the relevant posts in the forum I posted below:

http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html
The URL above mentions:
*
>> Could you give me an *example* about bulilding google 3-gram LM file
>> ,please?
>>
>Again, this will require using the  option with some tricks
>that are not documents
>as yet. Please be patient (or read all the manual pages carefully to
>figure it our yourself.)*
*
*Has any documentations been made regarding this? Did the trick infer using
mix-lm or expand-classes to force arpa format?

I figure worst case I do it manually but am sure there is something in SRILM
that I am missing.

Thanks
Elias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20090613/f1bddd36/attachment.html>


More information about the SRILM-User mailing list