[SRILM User List] Generating sentences from a Google N-Gram LM

Benjamin Lambert benlambert at cmu.edu
Thu Feb 18 20:00:48 PST 2010


Hi there,

I'm pretty new to the SRI LM toolkit.  I couldn't figure out how to search the mailing list archives, so forgive me if this has come up before.

I'd like to *generate* random sentences using a LM based on Google n-grams (GNG).  I tried following the directions in the FAQ on using GNG.  I don't have any particular corpus in mind so I used the vocab from the WSJ HUB4 dataset (it was all I have handy--it's also all CAPS, so I made in lowercase in case that would help).  That vocab file is about 15k words.

This is the command I'm using at the end to generate:
ngram -memuse -debug 3 -order 3 -count-lm -lm google.countlm -gen 1 -vocab wsj-lc.vocab -limit-vocab -vocab-aliases google.alias

My questions are:
1) Am I on the right track here?

2) After launching the 'ngram' binary, it prints numerous times:
"gunzip: stdout: Broken pipe

gunzip: stdout: Broken pipe

gunzip: stdout: Broken pipe"

Is that normal?  It seems to finish anyway.

3) It takes a very long time to generate a single sentence.  Is that expected?  (I imagine, yes, because of the file-format).  Would it be faster if it weren't unzipping the data?

4) When I finally do get a sentence generated, it's *very* long and has *many* >unk>'s.  Like, from the command above, I get a sentence with 50,004 words.  Actually, all sentences generated seem to be 50004 words...  This doesn't look quite right to me.  Any idea what's happening here?  Maybe it's not handling the begin and end of sentence markers properly...  I'll paste the beginning of the 50k word sentence, generated by the command above (that is, with order 3), below.

Thank you,
Ben

costs such generic technology now identified thing costs and shop style valve aged do ever must to science and providing if po address own do gets provide offering provide if current corresponding to and and to and >unk> space and to units to finance to placement and arrangement and to >unk> >unk> >unk> now and and registrations >unk> >unk> prior names novel and >unk> >unk> if accepted she array >unk> to and wide and under went >unk> court corresponding to secured >unk> performance and >unk> >unk> >unk> now linked to and me and profits to do completing start man yard cell and and >unk> sanitary >unk> crimes >unk> manage statement valve increasing >unk> >unk> >unk> such >unk> if law starting to error to many during me >unk> ever costs to address to to many provide address leaf environment to and she to press >unk> >unk> performance man many increasing climbed >unk> >unk> >unk> knows which reporting to compete under to >unk> >unk> stores microsoft wild >unk> >unk> >unk> >unk> to me do and and own individual front son such me >unk> >unk> murder problem and moisture and during and been excluded do to prepare old forward and now came math distributor inadequate >unk> to want to hit store >unk> solid files equivalent first tricks and session own prefer and do and won managed and to block accomplished do date to to start to >unk> do >unk> do to merlin agree to store own to worry >unk> many doctors programming level and rate provide >unk> >unk> if buyback >unk> >unk> to >unk> never ever own >unk> >unk> if >unk> yard rights to current attitude current court and array color rated first space under hall to many beginning to prefer and want to to knows cities fifth rapid custom describe gathering and level must chicken rights to me and pure >unk> >unk> thing if each if subsidiary elements >unk> >unk> >unk> >unk> >unk> >unk> >unk> >unk> >unk> >unk> >unk> >unk> to to and and compete to easily now blew to if currently now to route space and business and old medical individual now which stores do want >unk> tax to word managed if to to >unk> >unk> >unk> persons to to >unk> >unk> >unk> >unk> trigger party inspiration if and to faith credit and probably prefer company word such and do and denied claim >unk> >unk> call everything bad and salaries law been stick to and >unk> first and came to >unk> >unk> ever packages constrained generic been to friends to do provide rapid do such >unk> >unk> costs originally shown to originally and me and won brands to to investigation such such first king to me do word bad and me >unk> >unk> working under share call and writers call >unk> >unk> >unk> word and became rights such everything me water under generic individual if now letters to to shown to court and to >unk> to >unk> >unk> >unk> >unk> >unk> >unk> knows to ever >unk> rights and man own to serious and >unk> activities prior to company clearly complaints >unk> shown under me old to do uncertain manage fees company currently >unk> now national to and >unk> additional water call business and >unk> attached >unk> under





--
Benjamin Lambert
Ph.D. Student of Computer Science
Carnegie Mellon University
www.cs.cmu.edu/~belamber
Mobile: 617-869-1844




More information about the SRILM-User mailing list