[SRILM User List] C(<s>) is always zero?

Andreas Stolcke stolcke at icsi.berkeley.edu
Wed Feb 15 07:31:44 PST 2012


On 2/15/2012 4:00 AM, James Kirby wrote:
> Hello,
>
> is there a reason why the unigram count of the auto-prepended sentence 
> start tag <s> is always zero? As can be seen from the output below, 
> the log probabilities are calculated counting the sentence send tags 
> </s> but not the start tags. Or have I just missed something horribly 
> obvious?

You are confusing a token's frequency in the text with the probability 
in the model.
Because <s> only occurs as part of an ngram's history, but never as the 
token being predicted, its probability is 0.  If P(<s>) were > 0, then 
(via backoff) you would also have P(<s> | ...) > 0 and the sum of 
probabilities over all allowed words would be < 1.

If you want the unigram probability of a sentence boundary, use the </s> 
tag.

Andreas



More information about the SRILM-User mailing list