Unexpected "ngram-count -recompute" result
Paul Melis
melis at cs.utwente.nl
Tue Dec 17 05:38:52 PST 2002
Hello,
We just noticed the following when using the -recompute flag of ngram-count. We're just try to generate uni- and bigram counts from trigram counts but some are missing:
[1 - directly summing uni-, bi- and trigram counts of a simple text file]
melis at luistervink:/local/export/melis/lm> cat t
<s> this is a test </s>
melis at luistervink:/local/export/melis/lm> ngram-count -text t -sort
</s> 1
<s> 1
<s> this 1
<s> this is 1
a 1
a test 1
a test </s> 1
is 1
is a 1
is a test 1
test 1
test </s> 1
this 1
this is 1
this is a 1
[2 - only summing trigram counts]
melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort
<s> this is 1
a test </s> 1
is a test 1
this is a 1
[3 - using the previous trigram counts to generate uni- and bigram counts]
melis at luistervink:/local/export/melis/lm> ngram-count -text t -write-order 3 -sort | ngram-count -recompute -sort -read -
<s> 1
<s> this 1
<s> this is 1
a 1
a test 1
a test </s> 1
is 1
is a 1
is a test 1
this 1
this is 1
this is a 1
We expected the output of 1 and 3 to be the same, but notice the missing unigrams "</s>" and "test". Also, the bigram "test </s>" is missing.
Is this a bug, or is there something we're missing here? It seems to be related to the end of sentence symbol.
This is with SRILM 1.3.2, BTW.
Regards,
Paul
--
melis at cs.utwente.nl
More information about the SRILM-User
mailing list