[SRILM User List] Restricting language model on n-gram types with at least n occurrences
claude.vividsky at gmail.com
claude.vividsky at gmail.com
Thu Aug 23 22:54:27 PDT 2018
This is the gt1 file
mincount 1
maxcount 1
discount 1 1
Do I have to change only the mincount parameter from 1 to 2? What about maxcount and discount?
Moreover, do I have to change the file from -gt2 option too? At the moment it looks like
mincount 1
maxcount 6
discount 1 0.1023587004895416
discount 2 0.3477414330218069
discount 3 0.6012207785629259
discount 4 1
discount 5 0.3320872274143302
discount 6 1
Thank you
Michael
> -----Original Message-----
> From: SRILM-User [mailto:srilm-user-bounces at speech.sri.com] On Behalf Of
> Andreas Stolcke
> Sent: Thursday, August 23, 2018 8:16 PM
> To: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] Restricting language model on n-gram types
> with at least n occurrences
>
> The problem is that the -gt2min option is overridden when reading the GT
> parameters from file (-gt1 option).
> Edit the "mincount" value in the file by hand, then run the ngram-count
> for LM estimation.
>
> Andreas
>
> On 8/23/2018 5:32 AM, claude.vividsky at gmail.com wrote:
> > Hello,
> >
> > we would like to build a language model where only n-gram types with at
> > least 2 or more occurrences are regarded. N-grams which occur only once
> are
> > to be discarded.
> >
> > We use the following workflow (example for bigrams)
> >
> > Count N-Grams for each text
> > ngram-count -order 2 -text text-group-001.txt -write
> > text-group-001.count
> > ngram-count -order 2 -text text-group-002.txt -write
> > text-group-002.count
> >
> > Merge N-Grams into a common file
> > ngram-merge -write text-group.count -- text-group-001.count
> > text-group-002.count
> >
> > Extract vocabulary and estimate GT parameters
> > ngram-count -sort -read text-group.count
> > -write-vocab text-group.vocab
> > -gt1 text-group.gt1
> > -gt2 text-group.gt2
> >
> > Build language model with -gt2min 1
> > ngram-count -sort -order 2 -read text-group-001.count
> > -vocab text-group.vocab
> > -lm text-group-001-1.lm
> > -gt2min 1
> > -gt1 text-group.gt1 -gt2 text-group.gt2
> >
> > Build language model with -gt2min 2
> > ngram-count -sort -order 2 -read text-group-001.count
> > -vocab text-group.vocab
> > -lm text-group-001-2.lm
> > -gt2min 2
> > -gt1 text-group.gt1 -gt2 text-group.gt2
> >
> > Apply the language model
> > ngram -order 2 -lm text-group-001-1.lm -ppl text-group-002.txt
> > -debug 0
> > ngram -order 2 -lm text-group-001-2.lm -ppl text-group-002.txt
> > -debug 0
> >
> > The problem is that the resulting language models text-group-001-1.lm
> and
> > text-group-001-2.lm are the same. Hence apply them to new texts results
> in
> > the same values.
> >
> > Both texts contains n-grams which occur once and other ones which occur
> > twice or more.
> >
> > What are we doing wrong?
> >
> > We appreciate your help!
> >
> > Claude
> >
> >
> >
> >
> >
> > _______________________________________________
> > SRILM-User site list
> > SRILM-User at speech.sri.com
> > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
> >
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
More information about the SRILM-User
mailing list