srilm toolkit
Andreas Stolcke
stolcke at speech.sri.com
Thu Sep 20 11:16:21 PDT 2007
Raquel Justo wrote:
> Dear Dr. Stolcke,
> I have seen in "SRILM - AN EXTENSIBLE LANGUAGE MODELING TOOLKIT"
> article that the srilm toolkit deals with class N-gram LMs and that it
> allows class members to be multiword strings .
> Although I have read the manual pages and seen that the "n-gram"
> command has several options as "-expand-classes k" and "-expand-exact
> k" for class expansion, I do not really understand how it works. Would
> you mind telling me where I could find further information related to
> this issue?
>
> I am working with class-based LMs and I propose the use of class
> n-gram LMs (where classes are made up of "multiword" strings or
> "subsequences of words") in two different ways:
> - In a first approach a multiword string is considered as a new
> lexical unit generated by joining the words and it is treated as a
> unique token. (e.g. "san_francisco", P(C_CITY_NAME)*P("san_franciso"|
> C_CITY_NAME))
> - Instead, in a second approach, the words (taking part in the
> multiword string) are separately studied and the conditioned
> probabilities are calculated. Thus, a class n-gram LM is generated on
> the one hand, and on the other hand a word n-gram LM is generated
> within each class. (e.g. "san francisco",
> P(C_CITY_NAME)*P(san|C_CITY_NAME)*P(francisco|san, C_CITY_NAME)).
It looks to me like your second approach is equivalent to the first,
modulo smoothing effects achieved by the different backing off
distributions you might use in estimating the component probabilities.
>
> I send in an attached file a paper published in the "IEEE workshop on
> machine learning and signal processing" explaining better the two
> approaches.
>
> Does the -expand-classes or the -expand-exact option do something
> similar to the aforementioned approaches do? or does it adapt the
> class n-gram LM to a word n-gram LM considering that the words takes
> into account the information related to the classes (e.g.
> P(san#C_CITY_NAME)*P(franciso#C_CITY_NAME|san#C_CITY_NAME))?
Here is a high-level description of what -expand-classes does:
1) generate a list of all word ngrams obtained by replacing the class
tokens in the given LM.
2) for each word ngram thus obtained:
a) compute the joint probability p of the entire word ngram,
according to the original class LM
b) compute the joint probability q of the prefixes (excluding
the last word) of the ngrams
c) compute the conditional ngram probability as p/q .
3) insert the newly generated word ngrams into the original LM, remove
the class-based ngrams
4) recompute backoff weights (renormalize the model)
Andreas
More information about the SRILM-User
mailing list