Building adapted language models

Andreas Stolcke stolcke at speech.sri.com
Wed Jun 10 22:36:53 PDT 2009


In message <297533.48294.qm at web110315.mail.gq1.yahoo.com>you wrote:
> 
> Hi,
> 
> Is there an option to give weights to certain training instances (sentences)? 
>  For example if I have some sentences that are more relevant to my translatio
> n domain and I want them to influence the LM 4 times more than the rest of th
> e data.
> 
> I've done this by repeating the more relevant training instances, which makes
>  the model training quite slow.  Is there an alternative way in SRILM?
> 
You can weight the counts, pool them, and train a single LM.  The
internal methods that perform sentence-level count generation actually
have an argument to scale the couns by a number, but this functionality
was not accessible at the command line.   So I added an option ngram-count
-text-has-weights that tells ngram-count that the first field in each
line is a count scaling factor (the number has to be an integer, but can
be a floating point number if -float-counts is enabled).  This is
available in the 1.5.9-beta version that you can download now.

Or you can train separate LMs for different subsets of data (this only makes if the unit of weighting it larger than a sentence,
e.g., a data source or corpus), and then interpolate (mix) the probability estimates with weights.
LM interpolation is described with the "-mix-lm" option in ngram(1).

Andreas




More information about the SRILM-User mailing list