[SRILM User List] finding likely substitutes quickly

Fri Apr 23 05:28:23 PDT 2010

Hi,

Recently I have been working with several disambiguation systems which
all rely on finding likely substitutes for a given target word in a
given context using a language model.  I am wondering if somebody has
worked on a smarter algorithm to do this fast, if approximate.

In order to take into account the right context as well as the left,
here is what I do:  Say we have a 3-gram model and the word sequence
is ABCDE with C being the target word.  I take a large subset of words
from the dictionary as potential substitutes for C (sometimes
constrained to be in the same part of speech etc. but still on the
order of thousands of candidates).  Then for each candidate X, I
calculate the probability of ABXDE:

P(ABXDE) = P(A) P(B|A) P(X|AB) P(D|ABX) P(E|ABXD)
  => P(X|AB) P(D|ABX) P(E|ABXD)   ;; the terms without X can be dropped
  => P(X|AB) P(D|BX) P(E|XD)          ;; the 3-gram assumption

I need the counts of at least 9 patterns for this probability, usually
more depending on the smoothing method.  I extract the patterns,
lookup the counts from Google ngram, and compute the smoothed
probabilities etc.  It is important I get the right substitutes and
the right probabilities for them for good disambiguation.

At the end it turns out I only need the top 10 or 100 most likely
substitutes, and I just went through a whole lot of trouble to get
them.  If some "oracle" whispered into my ear 50 words which are very
likely to contain the top 10, that would save me a lot of work.  Or
maybe there is a whole another approach I haven't thought of...  What
is your suggestion?

thanks,
deniz