[SRILM User List] Restricting language model on n-gram types with at least n occurrences

Andreas Stolcke stolcke at icsi.berkeley.edu
Thu Aug 23 11:16:04 PDT 2018


The problem is that the -gt2min option is overridden when reading the GT 
parameters from file (-gt1 option).
Edit the "mincount" value in the file by hand, then run the ngram-count 
for LM estimation.

Andreas

On 8/23/2018 5:32 AM, claude.vividsky at gmail.com wrote:
> Hello,
>
> we would like to build a language model where only n-gram types with at
> least 2 or more occurrences are regarded. N-grams which occur only once are
> to be discarded.
>
> We use the following workflow (example for bigrams)
>
> Count N-Grams for each text
> 	ngram-count -order 2 -text text-group-001.txt -write
> text-group-001.count
> 	ngram-count -order 2 -text text-group-002.txt -write
> text-group-002.count
>
> Merge N-Grams into a common file
> 	ngram-merge -write text-group.count -- text-group-001.count
> text-group-002.count
>
> Extract vocabulary and estimate GT parameters
>        ngram-count -sort -read text-group.count
> 		-write-vocab text-group.vocab
> 		-gt1 text-group.gt1
> 		-gt2 text-group.gt2
>
> Build language model with -gt2min 1
> 	ngram-count -sort -order 2 -read text-group-001.count
> 		-vocab text-group.vocab
> 		-lm text-group-001-1.lm
> 		-gt2min 1
> 		-gt1 text-group.gt1 -gt2 text-group.gt2
>
> Build language model with -gt2min 2
> 	ngram-count -sort -order 2 -read text-group-001.count
> 		-vocab text-group.vocab
> 		-lm text-group-001-2.lm
> 		-gt2min 2
> 		-gt1 text-group.gt1 -gt2 text-group.gt2
>
> Apply the language model
> 	ngram -order 2 -lm text-group-001-1.lm -ppl text-group-002.txt
> -debug 0
> 	ngram -order 2 -lm text-group-001-2.lm -ppl text-group-002.txt
> -debug 0
>
> The problem is that the resulting language models text-group-001-1.lm and
> text-group-001-2.lm are the same. Hence apply them to new texts results in
> the same values.
>
> Both texts contains n-grams which occur once and other ones which occur
> twice or more.
>
> What are we doing wrong?
>
> We appreciate your help!
>
> Claude
>
>
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>




More information about the SRILM-User mailing list