[SRILM User List] Restricting language model on n-gram types with at least n occurrences

Thu Aug 23 22:54:27 PDT 2018

This is the gt1 file

	mincount 1
	maxcount 1
	discount 1 1

Do I have to change only the mincount parameter from 1 to 2? What about maxcount and discount?

Moreover, do I have to change the file from -gt2 option too? At the moment it looks like

	mincount 1
	maxcount 6
	discount 1 0.1023587004895416
	discount 2 0.3477414330218069
	discount 3 0.6012207785629259
	discount 4 1
	discount 5 0.3320872274143302
	discount 6 1

Thank you
Michael

> -----Original Message-----
> From: SRILM-User [mailto:srilm-user-bounces at speech.sri.com] On Behalf Of
> Andreas Stolcke
> Sent: Thursday, August 23, 2018 8:16 PM
> To: srilm-user at speech.sri.com
> Subject: Re: [SRILM User List] Restricting language model on n-gram types
> with at least n occurrences
> 
> The problem is that the -gt2min option is overridden when reading the GT
> parameters from file (-gt1 option).
> Edit the "mincount" value in the file by hand, then run the ngram-count
> for LM estimation.
> 
> Andreas
> 
> On 8/23/2018 5:32 AM, claude.vividsky at gmail.com wrote:
> > Hello,
> >
> > we would like to build a language model where only n-gram types with at
> > least 2 or more occurrences are regarded. N-grams which occur only once
> are
> > to be discarded.
> >
> > We use the following workflow (example for bigrams)
> >
> > Count N-Grams for each text
> > 	ngram-count -order 2 -text text-group-001.txt -write
> > text-group-001.count
> > 	ngram-count -order 2 -text text-group-002.txt -write
> > text-group-002.count
> >
> > Merge N-Grams into a common file
> > 	ngram-merge -write text-group.count -- text-group-001.count
> > text-group-002.count
> >
> > Extract vocabulary and estimate GT parameters
> >        ngram-count -sort -read text-group.count
> > 		-write-vocab text-group.vocab
> > 		-gt1 text-group.gt1
> > 		-gt2 text-group.gt2
> >
> > Build language model with -gt2min 1
> > 	ngram-count -sort -order 2 -read text-group-001.count
> > 		-vocab text-group.vocab
> > 		-lm text-group-001-1.lm
> > 		-gt2min 1
> > 		-gt1 text-group.gt1 -gt2 text-group.gt2
> >
> > Build language model with -gt2min 2
> > 	ngram-count -sort -order 2 -read text-group-001.count
> > 		-vocab text-group.vocab
> > 		-lm text-group-001-2.lm
> > 		-gt2min 2
> > 		-gt1 text-group.gt1 -gt2 text-group.gt2
> >
> > Apply the language model
> > 	ngram -order 2 -lm text-group-001-1.lm -ppl text-group-002.txt
> > -debug 0
> > 	ngram -order 2 -lm text-group-001-2.lm -ppl text-group-002.txt
> > -debug 0
> >
> > The problem is that the resulting language models text-group-001-1.lm
> and
> > text-group-001-2.lm are the same. Hence apply them to new texts results
> in
> > the same values.
> >
> > Both texts contains n-grams which occur once and other ones which occur
> > twice or more.
> >
> > What are we doing wrong?
> >
> > We appreciate your help!
> >
> > Claude
> >
> >
> >
> >
> >
> > _______________________________________________
> > SRILM-User site list
> > SRILM-User at speech.sri.com
> > http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
> >
> 
> 
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user