[SRILM User List] Restricting language model on n-gram types with at least n occurrences

Thu Aug 23 05:32:22 PDT 2018

Hello,

we would like to build a language model where only n-gram types with at
least 2 or more occurrences are regarded. N-grams which occur only once are
to be discarded.

We use the following workflow (example for bigrams)

Count N-Grams for each text
	ngram-count -order 2 -text text-group-001.txt -write
text-group-001.count
	ngram-count -order 2 -text text-group-002.txt -write
text-group-002.count

Merge N-Grams into a common file
	ngram-merge -write text-group.count -- text-group-001.count
text-group-002.count

Extract vocabulary and estimate GT parameters
      ngram-count -sort -read text-group.count 
		-write-vocab text-group.vocab 
		-gt1 text-group.gt1 
		-gt2 text-group.gt2

Build language model with -gt2min 1
	ngram-count -sort -order 2 -read text-group-001.count 
		-vocab text-group.vocab 
		-lm text-group-001-1.lm 
		-gt2min 1
		-gt1 text-group.gt1 -gt2 text-group.gt2

Build language model with -gt2min 2
	ngram-count -sort -order 2 -read text-group-001.count 
		-vocab text-group.vocab 
		-lm text-group-001-2.lm 
		-gt2min 2 
		-gt1 text-group.gt1 -gt2 text-group.gt2

Apply the language model
	ngram -order 2 -lm text-group-001-1.lm -ppl text-group-002.txt
-debug 0
	ngram -order 2 -lm text-group-001-2.lm -ppl text-group-002.txt
-debug 0

The problem is that the resulting language models text-group-001-1.lm and
text-group-001-2.lm are the same. Hence apply them to new texts results in
the same values.

Both texts contains n-grams which occur once and other ones which occur
twice or more.

What are we doing wrong?

We appreciate your help!

Claude