[SRILM User List] Difference between segment and hidden-ngram
stolcke at icsi.berkeley.edu
Wed Nov 23 08:34:23 PST 2016
On 11/22/2016 11:26 PM, Eeva Nikkari wrote:
> I'm getting different results from using the segment function or the
> hidden-ngram function with a hidden vocabulary of "</s>" in segmenting
> text into sentences.
> In both cases I used the same wb-discounted 3-gram model, but I get
> different segmentations depending on whether I use hidden-ngram or
> segment. (similairly with other models I've tried)
> It seems segment assigns more sentence boundaries (and performs better).
> What's the difference between using segment and hidden-ngram with
> hidden-vocab "</s>" ?
> I would like to use hidden-ngram since I want to test out higher order
> models as well, but it's strange that segment works better.
> Thank you,
segment and hidden-ngram are similar but they use the LM in slightly
segment uses the <s> and </s> tokens contained in a standard LM to model
the hidden sentence boundaries. So the probability of a latent boundary
between a and b would be evaluated as the product of
p(</s> | a)
p(b | </s>)
if only bigrams were involved (for higher-order ngrams you'd have to
include more tokens before and after).
hidden-ngram uses a separate, user-defined set of tokens to signify
hidden events. To perform sentence segmentation you would mark all
boundaries with a tag like <B> and then hidden-ngram evaluates the
likelihood of a boundary as
p(<B> | a)
p(b | <B>)
(again, using only bigrams for simplicity).
You cannot just use the <s> and </s> tokens with hidden-ngram because of
the special way that the end/start of a sentence are encoded by two
different tokens. hidden-ngram assumes you have a single token type
for the hidden boundary. This allows ngrams of the form "a <B> b",
whereas you will never find an ngram "a </s> <s> b" in a standard LM.
You could try to manipulate the LM by replacing <s> and </s> with the
same tag, and use that with hidden-ngram.
But then you've lost the ability to model the start and end of input
string in the normal way.
Another difference is that segment only handles trigrams, not
higher-order models. Basically, the segment program was an early hack
to deal with automatic segmentation that go rationalized later in the
hidden-ngram tool. I don't recommend using it.
More information about the SRILM-User