[SRILM User List] Constraining class building

Jonathan Mendoza mrfox321 at gmail.com
Wed May 23 10:35:12 PDT 2018


SRILM community,

I am trying to work with ngram-classes.  More specifically, I want to
connect new vocabulary that is semantically similar to certain vocabulary
within the corpus.  e.g. My language model has the class $organization =
{ibm, intel} and I know that {google}, which is not in the training corpus,
will show up in the same context in some test corpus.  The corpus /
language model that I am working with is much simpler, meaning that the
language is very much like a template (or mad libs).

As a result of the structure of the corpus I am working with, I am only
concerned with a few (2-5) multi-word clusters, while retaining single
element classes for rest of the vocabulary.  This means that numclasses is
going to be on the order of {V - O(|C|)} where |C| is expected cardinality
of the set.  I also plan on defining the initial clusters that would be
appended during the merging via ngram-classes.

Does ngram-classes support a method for constraining the class merging to
only work between single-word classes and the predefined multi-word classes?

My initial attempt at a solution would be to iterate over a range of
numclasses with the aforementioned base-classes and see how classes are
formed from the initial conditions.  My worry is that words not in the
initial multi-word classes will merge, leading to a Null result.

For the time being, I am going to use the -full flag to glean intuition
about word clusters, then plan my class initialization accordingly.

Best,
Jon
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20180523/b23b06f7/attachment.html>


More information about the SRILM-User mailing list