MAP adaptation in SRILM
Andreas Stolcke
stolcke at speech.sri.com
Fri Dec 12 15:39:45 PST 2003
In message <20031212154054.96724.qmail at web41713.mail.yahoo.com>you wrote:
> --0-654912475-1071243654=:93703
> Content-Type: text/plain; charset=us-ascii
>
> Hi,
>
> I wonder if anyone has implemented MAP adaptation in SRILM yet. Is it availab
> le somewhere?
>
Jean-Francois,
there are two ways to do something like MAP adaptation.
The traditional way to do MAP adaption estimates a new model from a
weighted mixture of background data counts and adaptation data counts.
You can do this easily by manipulating the N-gram count files and then
giving the combined counts to ngram-count to estimate a new model.
For example, say you have your background data in BDATA, and your
adaptation data in ADATA, then you would do something like
ngram-count -text BDATA -write BDATA.counts
ngram-count -text ADATA -write ADATA.counts
cat BDATA.counts ADATA.counts ADATA.counts ADATA.counts | \
ngram-count -read - -lm ADAPTED-LM
(I'm omitting options controlling ngram order, smoothing, etc.).
In this case I'm weighting the adaptation data 3 times, just by repeating
the counts. In general you want to write a little script that takes
a count file and multiplies the counts by some constant (i.e., the
adaptation weight).
However, this approach has some problems, because by manipulating the
data through weighting (with a weight other than 1) you are messing up
the count-of-count statistics that underlie most of the discounting
schemes (GT and KN). so you might have to use a smoothing algorithm
such as Witten-Bell that doesn't care about this, but isn't as good.
If you want to use a non-integer adaptation weight you have to use
ngram-count -float-counts which limits your choice of smoothing algorithms
in a similar way.
The other, more commonly used LM adaptation approach is simple model
interpolation. You could achieve a similar effect as in the example above
(weighting the adaptation data three times) with
ngram-count -text BDATA -lm BDATA.lm
ngram-count -text ADATA -lm ADATA.lm
ngram -lm ADATA.lm -mix-lm BDATA.lm -lambda L -write-lm ADAPTED-LM
where L is the weight given to the *model* for the adaptation data (as
opposed to the data itself). Because the two source models are
normalized first, then combined the value of L will be less
dependent on the relative size of BDATA versus ADATA. If you ignore
smoothing and assume MLE estimates you can figure out a value of L that
is equivalent to the first approach for a given amount of data and
adaptation data weight (a recent paper by Michiel Bacchiani and Brian
Roark elaborates on this: Unsupervised language model adaptation,
http://www.research.att.com/~roark/ICASSP03.pdf).
In any case, the second approach is widely used and works quite well.
It is also quite convenient since you can combine a bunch of preexisting LMs
in various ways, without retraining any of them. Also, SRILM has a tool
for estimating the optimal interpolation weight L from held-out data
(see the ppl-scripts(1) man page).
--Andreas
More information about the SRILM-User
mailing list