Interpolating Lang Models for Indonesian ASR

Wed Sep 7 09:32:09 PDT 2005

My first suggestion is to make sure that all LMs you are comparing and
interpolating use the same vocabulary.  In SRILM you can enforce this 
by using the -vocab option when building the LM.

PPLs over different vocabularies are of course not comparable, and it is 
easy to mess this up when building LMs from different data sets.

--Andreas

In message <20050907105509.BSN53884 at mail-msgstore01.qut.edu.au>you wrote:
> Hi All,
> 
> At present I am trying to use the SRI tools to improve the 
> LM for an Indonesian ASR system we are building. We have 
> just over ten hours of Australian Broadcast Commision 
> training data and at present the system gets just over 80% 
> on a heldout test set, with a bigram LM trained on the 
> training data. However, we also have approximately 12 
> million words of Text from Indonesian papers Kompass and 
> Tempo and were hoping that we could interpolate these with 
> the existing ABC LM to improve the ngram estimates and 
> subsequent perplexity.
> 
> Evaluating the ABC data PPL on a separate dev transcript 
> provides a ppl=297
> 
> Noting the advice given in the package notes when using 
> limited vocabs(vocab is 11000 words) I computed discount 
> coefficients first on unlimited vocab.and then subsequently 
> used these as input to a second pass of n-gram-count to get 
> the LM. I used good-turing.
> 
> 
> I then ran 
> 
> ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab 
> $DESIRED_VOCAB -limit-vocab -ppl output/sr\
> i_trans/$PPL_CORPUS.dev.sri.trans
> 
> to get perplexity score
> 
> Using a similar technique using the much larger set of 
> Kompass text produces a ppl score of 808 when evaluated on 
> the ABC dev set.
> 
> 
> All is well and good, until I try and interpolate the 2. I 
> have trialled two approaches. The first uses the dynamic 
> interpolation capabality incorporated in ngram.Using 
> 
> ngram -bayes 0  -lm ./ABC/lm/arpa.bo.lm -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 2 -
> ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342  ie 
> much worse than the original 297.
> 
> I then tried using the "compute-best-mix" utility which 
> starts of as expected at lambda values 0.5 and 0.5 and 
> iterates to 0.66 and 0.33.Plugging these vals into 
> 
> ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/
> $PPL_CORPUS.dev.sri.tran|tail yields 
> s
> 
>          ppl= 331.8 ppl1= 608.504
> 
> 
> still worse. I would expect it to perhaps stay the same, and 
> iterate to lambda values which excluded the Kompass data, 
> but these seem to be at odds with the ppl score. I then 
> trialled the same technique using Switchboard and Gigaword 
> and got the normal expected behaviour ie improvement
> 
> Unsure of whether this was because the Kompass data was 
> unsuitable or I was just making a foolish error somewhere I 
> trialled CMU LM toolkit. Agian using gt discounting to build 
> a lm and evaluate on ABC devset gives a ppl of 268 which was 
> a little surprising. More surprising was when I used their 
> interpolation tools. To cut the story short it produces:
> 
> 
> 			weights: 0.547  0.453  (7843 items) -
> -> PP=152.624029
>                         		
> 	=============>  TOTAL PP = 152.624
> 
> No doubt the devil is in the detail, but has anyone got some 
> suggestions. 
> 
> Cheers 
> 
> Terry Martin
> QUT Speech Lab
> Australia