[SRILM User List] finding likely substitutes quickly
Deniz Yuret
dyuret at ku.edu.tr
Fri Apr 23 05:28:23 PDT 2010
Hi,
Recently I have been working with several disambiguation systems which
all rely on finding likely substitutes for a given target word in a
given context using a language model. I am wondering if somebody has
worked on a smarter algorithm to do this fast, if approximate.
In order to take into account the right context as well as the left,
here is what I do: Say we have a 3-gram model and the word sequence
is ABCDE with C being the target word. I take a large subset of words
from the dictionary as potential substitutes for C (sometimes
constrained to be in the same part of speech etc. but still on the
order of thousands of candidates). Then for each candidate X, I
calculate the probability of ABXDE:
P(ABXDE) = P(A) P(B|A) P(X|AB) P(D|ABX) P(E|ABXD)
=> P(X|AB) P(D|ABX) P(E|ABXD) ;; the terms without X can be dropped
=> P(X|AB) P(D|BX) P(E|XD) ;; the 3-gram assumption
I need the counts of at least 9 patterns for this probability, usually
more depending on the smoothing method. I extract the patterns,
lookup the counts from Google ngram, and compute the smoothed
probabilities etc. It is important I get the right substitutes and
the right probabilities for them for good disambiguation.
At the end it turns out I only need the top 10 or 100 most likely
substitutes, and I just went through a whole lot of trouble to get
them. If some "oracle" whispered into my ear 50 words which are very
likely to contain the top 10, that would save me a lot of work. Or
maybe there is a whole another approach I haven't thought of... What
is your suggestion?
thanks,
deniz
More information about the SRILM-User
mailing list