srilm toolkit
Andreas Stolcke
stolcke at speech.sri.com
Fri Sep 21 12:53:39 PDT 2007
>
> El 20/09/2007, a las 20:16, Andreas Stolcke escribió:
>
> > Raquel Justo wrote:
> >>
> >> I am working with class-based LMs and I propose the use of class n-
> >> gram LMs (where classes are made up of "multiword" strings or
> >> "subsequences of words") in two different ways:
> >> - In a first approach a multiword string is considered as a new
> >> lexical unit generated by joining the words and it is treated as a
> >> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P
> >> ("san_franciso"| C_CITY_NAME))
> >> - Instead, in a second approach, the words (taking part in the
> >> multiword string) are separately studied and the conditioned
> >> probabilities are calculated. Thus, a class n-gram LM is generated
> >> on the one hand, and on the other hand a word n-gram LM is
> >> generated within each class. (e.g. "san francisco", P(C_CITY_NAME)
> >> *P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)).
> > It looks to me like your second approach is equivalent to the
> > first, modulo smoothing effects achieved by the different backing
> > off distributions you might use in estimating the component
> > probabilities.
>
> I don't know if I have understood very well what you want to say but
> I think that using backing off smoothing the first approach is
> different from the second one because different combination of all
> the words belonging to a class are allowed and in the second approach
> instead, only the considered subsequences of words are allowed
> because they are treated as unigrams inside each class. I think that
> even when no smoothing is considered the first approach can
> generalize better due to the fact that n-gram models themselves
> generalize on the training data.
You are right. That's actually what I meant by "different backing off".
> >>
> >> I send in an attached file a paper published in the "IEEE workshop
> >> on machine learning and signal processing" explaining better the
> >> two approaches.
> >>
> >> Does the -expand-classes or the -expand-exact option do something
> >> similar to the aforementioned approaches do? or does it adapt the
> >> class n-gram LM to a word n-gram LM considering that the words
> >> takes into account the information related to the classes (e.g. P
> >> (san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))?
> > Here is a high-level description of what -expand-classes does:
> >
> > 1) generate a list of all word ngrams obtained by replacing the
> > class tokens in the given LM.
> > 2) for each word ngram thus obtained:
> > a) compute the joint probability p of the entire word
> > ngram, according to the original class LM
>
> Would you mind telling me how you compute this probability when
> multiwords are considered?
> do you consider the multiword as a unique token or do you estimate
> the conditional probabilities between the words that make up the
> multiword?
Are you talking about multiwords that are joined by underscores
(as handled by the -multiwords) option? In that case there is no
special processing for them in ngram -expand-classes. The class mechanism
treats multiwords as regular word tokens.
If you are asking about class expansions that contain multiple words
separated by spaces (e.g. CITY -> San Franscisco) then the answer is that
the expansion algorithm deals with them just fine. The algorithm I outlined
above handles this case quite naturally.
I forgot to mention one feature of the expansion algorithm:
If the same word ngram can be generated by expanding different class ngrams
then to corresponding joint probabilities are added, as they should be.
Andreas
More information about the SRILM-User
mailing list