Google Web N-gram
Elias Majic
elias.majic at gmail.com
Sat Jun 13 11:42:00 PDT 2009
Hello,
First off, to save you from having to read the below, suppose I used
make-google-ngrams to store a small corpus of text's N-gram counts on disk
in googles format. How do I then convert this to ARPA format with SRILM?
I have read the Google Web N-gram section in the F.A.Q, I read all the
emails with the search term google in it and I read all the relevant man
pages as well as looked at relevant run-tests without success.
My goal is to make an arpa format language model from the N-gram counts
inside the Google Web N-gram corpus. I realize its too large to load into
memory as discussed in the documentation, so as per one of the emails in the
list suggested, I pruned out most of the junk or non dictionary words and
merged different cases and fixed the config files. So now I reduced the
data quite significantly and am unable to figure out how to convert it to
arpa format. Below is what I tried:
1.ngram -order 5 -count-lm -lm google.countlm -write-lm arpaLM
This did not work. It produced the same duplicate file of google.countlm
2. I noticed in the man pages that using the command -expand-classes forced
the output to be a single ngram model in ARPA format. So I tried:
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -write-lm
arpaLM
Nothing happened but the output:
HMM, NgramCountLM, AdaptiveMix, Decipher, tagged, factored, DF, hidden
N-gram, hidden-S, class N-gram, skip N-gram and stop-word N-gram models are
mutually exclusive
3.I thought maybe using mix-lm would result in an arpa model as it also says
in the man pages this would occur with mix-lm. I realize this was unlikely
to work as I am combining the same lm's but tried regardless.
ngram -order 5 -count-lm -lm google.countlm -expand-classes 5 -mix-lm
google.countlm -write-lm arpaLM
Output was the same as google.countlm
I tried other things like using ngram-count and running the lm-scripts but
no dice. One of the relevant posts in the forum I posted below:
http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/2007-April/8.html
The URL above mentions:
*
>> Could you give me an *example* about bulilding google 3-gram LM file
>> ,please?
>>
>Again, this will require using the option with some tricks
>that are not documents
>as yet. Please be patient (or read all the manual pages carefully to
>figure it our yourself.)*
*
*Has any documentations been made regarding this? Did the trick infer using
mix-lm or expand-classes to force arpa format?
I figure worst case I do it manually but am sure there is something in SRILM
that I am missing.
Thanks
Elias
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20090613/f1bddd36/attachment.html>
More information about the SRILM-User
mailing list