[SRILM User List] ngram-count's ARPA N-gram LM extensions beyond "\end\" marker

Mon Jun 24 19:44:07 PDT 2013

On 6/24/2013 10:41 AM, Sander Maijers wrote:
>
> Based on the equations you described to me and the code, I do not see 
> the fundamental difference with skip N-gram model and Jelinek-Mercer 
> smoothing / deleted interpolation (Chen & Goodman, 1999, eqn. 4 p. 
> 364). In the skip LM the skip probabilities substitute the lambda 
> weights in the Jelinek-Mercer equation, and are estimated in the 
> perhaps special way you explained. Is there something I miss?

Jelinek-Mercer is a way to smooth N-gram probabilities by combining 
estimates based on different suffixes of the history, e.g.

p(w|w1 w2 w3)  = l1 * p'(w|w1 w2 w3) + l2 * p'(w|w1 w2)  + l3 * p'(w|w1) 
+ l4 * p'(w) + l5 / N             (N = size of vocabulary)

where p'(.) is a maximum-likelihood estimate.

In skip-Ngram modeling, by contrast, you combine different histories 
that differ by skipping a word, e.g.

p(w | w1 w2 w3 w4) = l1 * p'(w | w1 w2 w3) + l2 * p'(w | w2 w3 w4)

where p'(.) now is smoothed estimate based.

The only similarity is that they both use linear interpolation of an 
underlying probability estimator to arrive at a better estimator. That's 
not saying much.   Linear interpolation is extremely widely used in all 
sorts of probability models.

Andreas