[SRILM User List] C(<s>) is always zero?
Andreas Stolcke
stolcke at icsi.berkeley.edu
Wed Feb 15 07:31:44 PST 2012
On 2/15/2012 4:00 AM, James Kirby wrote:
> Hello,
>
> is there a reason why the unigram count of the auto-prepended sentence
> start tag <s> is always zero? As can be seen from the output below,
> the log probabilities are calculated counting the sentence send tags
> </s> but not the start tags. Or have I just missed something horribly
> obvious?
You are confusing a token's frequency in the text with the probability
in the model.
Because <s> only occurs as part of an ngram's history, but never as the
token being predicted, its probability is 0. If P(<s>) were > 0, then
(via backoff) you would also have P(<s> | ...) > 0 and the sum of
probabilities over all allowed words would be < 1.
If you want the unigram probability of a sentence boundary, use the </s>
tag.
Andreas
More information about the SRILM-User
mailing list