Interpolating Lang Models for Indonesian ASR

Tue Sep 6 17:55:09 PDT 2005

Hi All,

At present I am trying to use the SRI tools to improve the 
LM for an Indonesian ASR system we are building. We have 
just over ten hours of Australian Broadcast Commision 
training data and at present the system gets just over 80% 
on a heldout test set, with a bigram LM trained on the 
training data. However, we also have approximately 12 
million words of Text from Indonesian papers Kompass and 
Tempo and were hoping that we could interpolate these with 
the existing ABC LM to improve the ngram estimates and 
subsequent perplexity.

Evaluating the ABC data PPL on a separate dev transcript 
provides a ppl=297

Noting the advice given in the package notes when using 
limited vocabs(vocab is 11000 words) I computed discount 
coefficients first on unlimited vocab.and then subsequently 
used these as input to a second pass of n-gram-count to get 
the LM. I used good-turing.

I then ran 

ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab 
$DESIRED_VOCAB -limit-vocab -ppl output/sr\
i_trans/$PPL_CORPUS.dev.sri.trans

to get perplexity score

Using a similar technique using the much larger set of 
Kompass text produces a ppl score of 808 when evaluated on 
the ABC dev set.

All is well and good, until I try and interpolate the 2. I 
have trialled two approaches. The first uses the dynamic 
interpolation capabality incorporated in ngram.Using 

ngram -bayes 0  -lm ./ABC/lm/arpa.bo.lm -mix-
lm ./Kompass/lm/arpa.bo.lm -debug 2 -
ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342  ie 
much worse than the original 297.

I then tried using the "compute-best-mix" utility which 
starts of as expected at lambda values 0.5 and 0.5 and 
iterates to 0.66 and 0.33.Plugging these vals into 

ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix-
lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/
$PPL_CORPUS.dev.sri.tran|tail yields 
s

         ppl= 331.8 ppl1= 608.504

still worse. I would expect it to perhaps stay the same, and 
iterate to lambda values which excluded the Kompass data, 
but these seem to be at odds with the ppl score. I then 
trialled the same technique using Switchboard and Gigaword 
and got the normal expected behaviour ie improvement

Unsure of whether this was because the Kompass data was 
unsuitable or I was just making a foolish error somewhere I 
trialled CMU LM toolkit. Agian using gt discounting to build 
a lm and evaluate on ABC devset gives a ppl of 268 which was 
a little surprising. More surprising was when I used their 
interpolation tools. To cut the story short it produces:

			weights: 0.547  0.453  (7843 items) -
-> PP=152.624029

	=============>  TOTAL PP = 152.624

No doubt the devil is in the detail, but has anyone got some 
suggestions. 

Cheers 

Terry Martin
QUT Speech Lab
Australia