Unexpected "ngram-count -recompute" result

Paul Melis melis at cs.utwente.nl
Tue Dec 17 05:38:52 PST 2002


Hello,

We just noticed the following when using the -recompute flag of ngram-count. We're just try to generate uni- and bigram counts from trigram counts but some are missing:

[1 - directly summing uni-, bi- and trigram counts of a simple text file]

melis at luistervink:/local/export/melis/lm> cat t
<s> this is a test </s>

melis at luistervink:/local/export/melis/lm> ngram-count -text t -sort
</s>    1
<s>     1
<s> this        1
<s> this is     1
a       1
a test  1
a test </s>     1
is      1
is a    1
is a test       1
test    1
test </s>       1
this    1
this is 1
this is a       1

[2 - only summing trigram counts]

melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort
<s> this is     1
a test </s>     1
is a test       1
this is a       1

[3 - using the previous trigram counts to generate uni- and bigram counts]

melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort | ngram-count -recompute -sort -read -
<s>     1
<s> this        1
<s> this is     1
a       1
a test  1
a test </s>     1
is      1
is a    1
is a test       1
this    1
this is 1
this is a       1

We expected the output of 1 and 3 to be the same, but notice the missing unigrams "</s>" and "test". Also, the bigram "test </s>" is missing. 
Is this a bug, or is there something we're missing here? It seems to be related to the end of sentence symbol. 
This is with SRILM 1.3.2, BTW.

Regards,
Paul

-- 
melis at cs.utwente.nl



More information about the SRILM-User mailing list