MAP adaptation in SRILM

Fri Dec 12 15:39:45 PST 2003

In message <20031212154054.96724.qmail at web41713.mail.yahoo.com>you wrote:
> --0-654912475-1071243654=:93703
> Content-Type: text/plain; charset=us-ascii
> 
> Hi,
>  
> I wonder if anyone has implemented MAP adaptation in SRILM yet. Is it availab
> le somewhere?
>  

Jean-Francois,

there are two ways to do something like MAP adaptation.

The traditional way to do MAP adaption estimates a new model from a 
weighted mixture of background data counts and adaptation data counts.
You can do this easily by manipulating the N-gram count files and then 
giving the combined counts to ngram-count to estimate a new model.
For example, say you have your background data in BDATA, and your 
adaptation data in ADATA, then you would do something like

	ngram-count -text BDATA -write BDATA.counts
	ngram-count -text ADATA -write ADATA.counts

	cat BDATA.counts ADATA.counts ADATA.counts ADATA.counts | \
	ngram-count -read - -lm ADAPTED-LM

(I'm omitting options controlling ngram order, smoothing, etc.).
In this case I'm weighting the adaptation data 3 times, just by repeating 
the counts.  In general you want to write a little script that takes
a count file and multiplies the counts by some constant (i.e., the 
adaptation weight).

However, this approach has some problems, because by manipulating the 
data through weighting (with a weight other than 1) you are messing up
the count-of-count statistics that underlie most of the discounting 
schemes (GT and KN).  so you might have to use a smoothing algorithm
such as Witten-Bell that doesn't care about this, but isn't as good.
If you want to use a non-integer adaptation weight you have to use 
ngram-count -float-counts which limits your choice of smoothing algorithms
in a similar way.

The other, more commonly used LM adaptation approach is simple model
interpolation.   You could achieve a similar effect as in the example above
(weighting the adaptation data three times) with 

	ngram-count -text BDATA -lm BDATA.lm
	ngram-count -text ADATA -lm ADATA.lm

	ngram -lm ADATA.lm -mix-lm BDATA.lm -lambda L -write-lm ADAPTED-LM

where L is the weight given to the *model* for the adaptation data (as
opposed to the data itself).  Because the two source models are
normalized first, then combined the value of L will be less
dependent on the relative size of BDATA versus ADATA.  If you ignore
smoothing and assume MLE estimates you can figure out a value of L that
is equivalent to the first approach for a given amount of data and
adaptation data weight (a recent paper by Michiel Bacchiani and Brian
Roark elaborates on this: Unsupervised language model adaptation,
http://www.research.att.com/~roark/ICASSP03.pdf).

In any case, the second approach is widely used and works quite well.
It is also quite convenient since you can combine a bunch of preexisting LMs
in various ways, without retraining any of them.   Also, SRILM has a tool
for estimating the optimal interpolation weight L from held-out data
(see the ppl-scripts(1) man page).

--Andreas