[SRILM User List] Restricting language model on n-gram types with at least n occurrences

Andreas Stolcke stolcke at icsi.berkeley.edu
Fri Aug 24 02:13:16 PDT 2018


Just change the mincount value in all the parameter files you use for 
-gt1, -g2, etc.

Andreas


On 8/23/2018 10:54 PM, claude.vividsky at gmail.com wrote:
> This is the gt1 file
>
> 	mincount 1
> 	maxcount 1
> 	discount 1 1
>
> Do I have to change only the mincount parameter from 1 to 2? What about maxcount and discount?
>
> Moreover, do I have to change the file from -gt2 option too? At the moment it looks like
>
> 	mincount 1
> 	maxcount 6
> 	discount 1 0.1023587004895416
> 	discount 2 0.3477414330218069
> 	discount 3 0.6012207785629259
> 	discount 4 1
> 	discount 5 0.3320872274143302
> 	discount 6 1
>
> Thank you
> Michael
>
>> -----Original Message-----
>> From: SRILM-User [mailto:srilm-user-bounces at speech.sri.com] On Behalf Of
>> Andreas Stolcke
>> Sent: Thursday, August 23, 2018 8:16 PM
>> To: srilm-user at speech.sri.com
>> Subject: Re: [SRILM User List] Restricting language model on n-gram types
>> with at least n occurrences
>>
>> The problem is that the -gt2min option is overridden when reading the GT
>> parameters from file (-gt1 option).
>> Edit the "mincount" value in the file by hand, then run the ngram-count
>> for LM estimation.
>>
>> Andreas
>>
>> On 8/23/2018 5:32 AM, claude.vividsky at gmail.com wrote:
>>> Hello,
>>>
>>> we would like to build a language model where only n-gram types with at
>>> least 2 or more occurrences are regarded. N-grams which occur only once
>> are
>>> to be discarded.
>>>
>>> We use the following workflow (example for bigrams)
>>>
>>> Count N-Grams for each text
>>> 	ngram-count -order 2 -text text-group-001.txt -write
>>> text-group-001.count
>>> 	ngram-count -order 2 -text text-group-002.txt -write
>>> text-group-002.count
>>>
>>> Merge N-Grams into a common file
>>> 	ngram-merge -write text-group.count -- text-group-001.count
>>> text-group-002.count
>>>
>>> Extract vocabulary and estimate GT parameters
>>>         ngram-count -sort -read text-group.count
>>> 		-write-vocab text-group.vocab
>>> 		-gt1 text-group.gt1
>>> 		-gt2 text-group.gt2
>>>
>>> Build language model with -gt2min 1
>>> 	ngram-count -sort -order 2 -read text-group-001.count
>>> 		-vocab text-group.vocab
>>> 		-lm text-group-001-1.lm
>>> 		-gt2min 1
>>> 		-gt1 text-group.gt1 -gt2 text-group.gt2
>>>
>>> Build language model with -gt2min 2
>>> 	ngram-count -sort -order 2 -read text-group-001.count
>>> 		-vocab text-group.vocab
>>> 		-lm text-group-001-2.lm
>>> 		-gt2min 2
>>> 		-gt1 text-group.gt1 -gt2 text-group.gt2
>>>
>>> Apply the language model
>>> 	ngram -order 2 -lm text-group-001-1.lm -ppl text-group-002.txt
>>> -debug 0
>>> 	ngram -order 2 -lm text-group-001-2.lm -ppl text-group-002.txt
>>> -debug 0
>>>
>>> The problem is that the resulting language models text-group-001-1.lm
>> and
>>> text-group-001-2.lm are the same. Hence apply them to new texts results
>> in
>>> the same values.
>>>
>>> Both texts contains n-grams which occur once and other ones which occur
>>> twice or more.
>>>
>>> What are we doing wrong?
>>>
>>> We appreciate your help!
>>>
>>> Claude
>>>
>>>
>>>
>>>
>>>
>>> _______________________________________________
>>> SRILM-User site list
>>> SRILM-User at speech.sri.com
>>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>




More information about the SRILM-User mailing list