[SRILM User List] Difference between segment and hidden-ngram

Andreas Stolcke stolcke at icsi.berkeley.edu
Wed Nov 23 08:34:23 PST 2016

On 11/22/2016 11:26 PM, Eeva Nikkari wrote:
> Hello,
> I'm getting different results from using the segment function or the 
> hidden-ngram function with a hidden vocabulary of "</s>" in segmenting 
> text into sentences.
> In both cases I used the same wb-discounted 3-gram model, but I get 
> different segmentations depending on whether I use hidden-ngram or 
> segment. (similairly with other models I've tried)
> It seems segment assigns more sentence boundaries (and performs better).
> What's the difference between using segment and hidden-ngram with 
> hidden-vocab "</s>" ?
> I would like to use hidden-ngram since I want to test out higher order 
> models as well, but it's strange that segment works better.
> Thank you,
> Eeva

segment and hidden-ngram are similar but they use the LM in slightly 
different ways.

segment uses the <s> and </s> tokens contained in a standard LM to model 
the hidden sentence boundaries.  So the probability of a latent boundary 
between a and b  would be evaluated as the product of

                 p(</s> | a)
                 p(b | </s>)

if only bigrams were involved (for higher-order ngrams you'd have to 
include more tokens before and after).

hidden-ngram uses a separate, user-defined set of tokens to signify 
hidden events.   To perform sentence segmentation you would mark all 
boundaries with a tag like <B> and then hidden-ngram evaluates the 
likelihood of a boundary as

             p(<B> | a)
             p(b | <B>)

(again, using only bigrams for simplicity).

You cannot just use the <s> and </s> tokens with hidden-ngram because of 
the special way that the end/start of a sentence are encoded by two 
different tokens.   hidden-ngram assumes you have a single token type 
for the hidden boundary.  This allows ngrams of the form "a <B> b",  
whereas you will never find an ngram "a </s> <s> b"  in a standard LM.

You could try to manipulate the LM by replacing <s> and </s> with the 
same tag, and use that with hidden-ngram.
But then you've lost the ability to model the start and end of input 
string in the normal way.

Another difference is that segment only handles trigrams, not 
higher-order models.  Basically, the segment program was an early hack 
to deal with automatic segmentation that go rationalized later in the 
hidden-ngram tool.  I don't recommend using it.


More information about the SRILM-User mailing list