srilm toolkit

Fri Sep 21 12:53:39 PDT 2007

> 
> El 20/09/2007, a las 20:16, Andreas Stolcke escribió:
> 
> > Raquel Justo wrote:
> >>
> >> I am working with class-based LMs and I propose the use of class n- 
> >> gram LMs (where classes are made up of "multiword" strings or  
> >> "subsequences of words") in two different ways:
> >> - In a first approach a multiword string is considered as a new  
> >> lexical unit generated by joining the words and it is treated as a  
> >> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P 
> >> ("san_franciso"| C_CITY_NAME))
> >> - Instead, in a second approach, the words (taking part in the  
> >> multiword string) are separately studied and the conditioned  
> >> probabilities are calculated. Thus, a class n-gram LM is generated  
> >> on the one hand, and on the other hand a word n-gram LM is  
> >> generated within each class. (e.g. "san francisco", P(C_CITY_NAME) 
> >> *P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)).
> > It looks to me like your second approach is equivalent to the  
> > first, modulo smoothing effects achieved by the different backing  
> > off distributions you might use in estimating the component  
> > probabilities.
> 
> I don't know if I have understood very well what you want to say but  
> I think that using backing off smoothing the first approach is  
> different from the second one because different combination of all  
> the words belonging to a class are allowed and in the second approach  
> instead, only the considered subsequences of words are allowed  
> because they are treated as unigrams inside each class. I think that  
> even when no smoothing is considered the first approach can  
> generalize better due to the fact that n-gram models themselves  
> generalize on the training data.

You are right.  That's actually what I meant by "different backing off".

> >>
> >> I send in an attached file a paper published in the "IEEE workshop  
> >> on machine learning and signal processing" explaining better the  
> >> two approaches.
> >>
> >> Does the -expand-classes or the -expand-exact option do something  
> >> similar to the aforementioned approaches do? or does it adapt the  
> >> class n-gram LM to a word n-gram LM considering that the words  
> >> takes into account the information related to the classes (e.g. P 
> >> (san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))?
> > Here is a high-level description of what -expand-classes does:
> >
> > 1) generate a list of all word ngrams obtained by replacing the  
> > class tokens in the given LM.
> > 2) for each word ngram thus obtained:
> >          a) compute the joint probability p of the entire word  
> > ngram, according to the original class LM
> 
> Would you mind telling me how you compute this probability when  
> multiwords are considered?
> do you consider the multiword as a unique token or do you estimate  
> the conditional probabilities between the words that make up the  
> multiword?

Are you talking about multiwords that are joined by underscores 
(as handled by the -multiwords) option?  In that case there is no
special processing for them in ngram -expand-classes.  The class mechanism
treats multiwords as regular word tokens.

If you are asking about class expansions that contain multiple words 
separated by spaces (e.g. CITY -> San Franscisco)  then the answer is that
the expansion algorithm deals with them just fine.  The algorithm I outlined
above handles this case quite naturally.

I forgot to mention one feature of the expansion algorithm:
If the same word ngram can be generated by expanding different class ngrams
then to corresponding joint probabilities are added, as they should be.

Andreas