Unexpected "ngram-count -recompute" result

Fri Dec 20 02:03:18 PST 2002

In message <20021217143852.A7495 at luistervink.cs.utwente.nl>you wrote:
> Hello,
> 
> We just noticed the following when using the -recompute flag of ngram-count. 
> We're just try to generate uni- and bigram counts from trigram counts but som
> e are missing:
> 
> [1 - directly summing uni-, bi- and trigram counts of a simple text file]
> 
> melis at luistervink:/local/export/melis/lm> cat t
> <s> this is a test </s>
> 
> melis at luistervink:/local/export/melis/lm> ngram-count -text t -sort
> </s>    1
> <s>     1
> <s> this        1
> <s> this is     1
> a       1
> a test  1
> a test </s>     1
> is      1
> is a    1
> is a test       1
> test    1
> test </s>       1
> this    1
> this is 1
> this is a       1
> 
> [2 - only summing trigram counts]
> 
> melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 
> -sort
> <s> this is     1
> a test </s>     1
> is a test       1
> this is a       1
> 
> [3 - using the previous trigram counts to generate uni- and bigram counts]
> 
> melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 
> -sort | ngram-count -recompute -sort -read -
> <s>     1
> <s> this        1
> <s> this is     1
> a       1
> a test  1
> a test </s>     1
> is      1
> is a    1
> is a test       1
> this    1
> this is 1
> this is a       1
> 
> We expected the output of 1 and 3 to be the same, but notice the missing unig
> rams "</s>" and "test". Also, the bigram "test </s>" is missing. 
> Is this a bug, or is there something we're missing here? It seems to be relat
> ed to the end of sentence symbol. 
> This is with SRILM 1.3.2, BTW.
> 
> Regards,
> Paul
> 

It's a bug of sorts, or a feature depending on your point of view.

Because </s> is not followed by anything, discarding unigrams and bigrams
ending in </s> will in fact discard information that is not contained
in the trigrams.  I'm not sure why you are doing what you describe,
but a quick solution would be to introduce "dummy" N-grams that 
complete the ngrams ending in </s> to the full length of the counts 
you want to keep.  The little scripts below does that.
If you call it "complete-eos-ngrams" then

ngram-count -text t -write - | \
complete-eos-ngrams | \
ngram-count -read - -write-order 3 | \
ngram-count -recompute -sort -read - 

will produce the output you expect.
Alternatively you could tack dummy words onto the end of your input 
sentences.  in either case you have to delete the dummy ngrams from the 
final output.

--Andreas

#!/usr/local/bin/gawk -f

BEGIN {
	order = 3;
}

{
	print;
}

$(NF - 1) == "</s>" { 
	count = $NF;

	for (i = NF; i <= order; i ++) {
		$i = "DUMMY"; 
		print $0, count;
	}
}