SRILM beginning and end tokens?

Tue Mar 20 21:27:00 PDT 2007

In message <20070320233327.E8AD478B51 at epoch.cs>you wrote:
> Dear Andreas,
> 
> I am very grateful to benefit from your work by using this toolkit.  It's
> great!  
> 
> I noticed it adds <s> and </s> tokens if they aren't there.  However, I'm
> modelling with trigrams, and it seems to add only one begin/end pair per
> sentence.  Is there an option I missed, or do I need to insert them myself?

For </s>, there is never a reason to add more than one such token,
the last ngram probability that goes into the sentence probability is

	p( </s> | ... ) 

For <s>, you also need no more than one token, since the backoff will
establish that 

	p( w1 | ... <s> ) = p(w1 | <s>)

I know that some other implementations add additional higher-order ngrams 
by filling in multiple copies of <s>, but I believe that is not well motivated.
It could also lead to unnatural count-of-count statistics for KN and GT
smoothing.

Andreas 

> 
> Thank you!
> -Amber
> 
> 
> \   L. Amber Wilcox-O'Hearn * http://www.cs.toronto.edu/~amber/   /
> -\  Graduate student * Computational Linguistics Research Group  /-
> --\   Department of Computer Science * University of Toronto    /--