SRILM beginning and end tokens?
stolcke at speech.sri.com
Tue Mar 20 21:27:00 PDT 2007
In message <20070320233327.E8AD478B51 at epoch.cs>you wrote:
> Dear Andreas,
> I am very grateful to benefit from your work by using this toolkit. It's
> I noticed it adds <s> and </s> tokens if they aren't there. However, I'm
> modelling with trigrams, and it seems to add only one begin/end pair per
> sentence. Is there an option I missed, or do I need to insert them myself?
For </s>, there is never a reason to add more than one such token,
the last ngram probability that goes into the sentence probability is
p( </s> | ... )
For <s>, you also need no more than one token, since the backoff will
p( w1 | ... <s> ) = p(w1 | <s>)
I know that some other implementations add additional higher-order ngrams
by filling in multiple copies of <s>, but I believe that is not well motivated.
It could also lead to unnatural count-of-count statistics for KN and GT
> Thank you!
> \ L. Amber Wilcox-O'Hearn * http://www.cs.toronto.edu/~amber/ /
> -\ Graduate student * Computational Linguistics Research Group /-
> --\ Department of Computer Science * University of Toronto /--
More information about the SRILM-User