srilm toolkit

Thu Sep 20 11:16:21 PDT 2007

Raquel Justo wrote:
> Dear Dr. Stolcke,
> I have seen in "SRILM - AN EXTENSIBLE LANGUAGE MODELING TOOLKIT" 
> article that the srilm toolkit deals with class N-gram LMs and that it 
> allows class members to be multiword strings .
> Although I have read the manual pages and seen that the "n-gram" 
> command has several options as "-expand-classes k" and "-expand-exact 
> k" for class expansion, I do not really understand how it works. Would 
> you mind telling me  where I could find further information related to 
> this issue?
>
> I am working with class-based LMs and I propose the use of class 
> n-gram LMs (where classes are made up of "multiword" strings or 
> "subsequences of words") in two different ways:
> - In a first approach a multiword string is considered as a new 
> lexical unit generated by joining the words and it is treated as a 
> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P("san_franciso"| 
> C_CITY_NAME))
> - Instead, in a second approach, the words (taking part in the 
> multiword string) are separately studied and the conditioned 
> probabilities are calculated. Thus, a class n-gram LM is generated on 
> the one hand, and on the other hand a word n-gram LM is generated 
> within each class. (e.g. "san francisco", 
> P(C_CITY_NAME)*P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)).
It looks to me like your second approach is equivalent to the first, 
modulo smoothing effects achieved by the different backing off 
distributions you might use in estimating the component probabilities.
>
> I send in an attached file a paper published in the "IEEE workshop on 
> machine learning and signal processing" explaining better the two 
> approaches.
>
> Does the -expand-classes or the -expand-exact option do something 
> similar to the aforementioned approaches do? or does it adapt the 
> class n-gram LM to a word n-gram LM considering that the words takes 
> into account the information related to the classes (e.g. 
> P(san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))?
Here is a high-level description of what -expand-classes does:

1) generate a list of all word ngrams obtained by replacing the class 
tokens in the given LM.
2) for each word ngram thus obtained:
          a) compute the joint probability p of the entire word ngram, 
according to the original class LM
          b) compute the joint probability q of the prefixes (excluding 
the last word) of the ngrams
          c) compute the conditional ngram probability as p/q .
3) insert the newly generated word ngrams into the original LM, remove 
the class-based ngrams
4) recompute backoff weights (renormalize the model)

Andreas