Interpolating Lang Models for Indonesian ASR

Wed Sep 7 10:01:18 PDT 2005

Hi, 

I wonder your texts have sentence boundary symbols such as "<s>, </s>". 
In CMU toolkit, you should define the symbols explictly in context cue file to control
these symbols. 

According to my experience, with same discounting scheme, same cut-off value and careful usage 
of back-off options(from unknown words and sentence boundary), the two LM toolkit generate
almost same perplexities. (except some round-off errors) 

On interpolation problem, IMHO, I suggest that you compare the N-gram probability stream on same text. 
(using fprobs files in CMU LM toolkit and -debug 2 option in SRI LM toolkit)

regards,

Joon Ki Choi

> -----Original Message-----
> From: owner-srilm-user at speech.sri.com 
> [mailto:owner-srilm-user at speech.sri.com] On Behalf Of 
> tl.martin at qut.edu.au
> Sent: Wednesday, September 07, 2005 9:55 AM
> To: srilm-user at speech.sri.com
> Subject: Interpolating Lang Models for Indonesian ASR
> 
> 
> Hi All,
> 
> At present I am trying to use the SRI tools to improve the 
> LM for an Indonesian ASR system we are building. We have 
> just over ten hours of Australian Broadcast Commision 
> training data and at present the system gets just over 80% 
> on a heldout test set, with a bigram LM trained on the 
> training data. However, we also have approximately 12 
> million words of Text from Indonesian papers Kompass and 
> Tempo and were hoping that we could interpolate these with 
> the existing ABC LM to improve the ngram estimates and 
> subsequent perplexity.
> 
> Evaluating the ABC data PPL on a separate dev transcript 
> provides a ppl=297
> 
> Noting the advice given in the package notes when using 
> limited vocabs(vocab is 11000 words) I computed discount 
> coefficients first on unlimited vocab.and then subsequently 
> used these as input to a second pass of n-gram-count to get 
> the LM. I used good-turing.
> 
> 
> I then ran 
> 
> ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab 
> $DESIRED_VOCAB -limit-vocab -ppl output/sr\
> i_trans/$PPL_CORPUS.dev.sri.trans
> 
> to get perplexity score
> 
> Using a similar technique using the much larger set of 
> Kompass text produces a ppl score of 808 when evaluated on 
> the ABC dev set.
> 
> 
> All is well and good, until I try and interpolate the 2. I 
> have trialled two approaches. The first uses the dynamic 
> interpolation capabality incorporated in ngram.Using 
> 
> ngram -bayes 0  -lm ./ABC/lm/arpa.bo.lm -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 2 -
> ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342  ie 
> much worse than the original 297.
> 
> I then tried using the "compute-best-mix" utility which 
> starts of as expected at lambda values 0.5 and 0.5 and 
> iterates to 0.66 and 0.33.Plugging these vals into 
> 
> ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/
> $PPL_CORPUS.dev.sri.tran|tail yields 
> s
> 
>          ppl= 331.8 ppl1= 608.504
> 
> 
> still worse. I would expect it to perhaps stay the same, and 
> iterate to lambda values which excluded the Kompass data, 
> but these seem to be at odds with the ppl score. I then 
> trialled the same technique using Switchboard and Gigaword 
> and got the normal expected behaviour ie improvement
> 
> Unsure of whether this was because the Kompass data was 
> unsuitable or I was just making a foolish error somewhere I 
> trialled CMU LM toolkit. Agian using gt discounting to build 
> a lm and evaluate on ABC devset gives a ppl of 268 which was 
> a little surprising. More surprising was when I used their 
> interpolation tools. To cut the story short it produces:
> 
> 
> 			weights: 0.547  0.453  (7843 items) -
> -> PP=152.624029
>                         		
> 	=============>  TOTAL PP = 152.624
> 
> No doubt the devil is in the detail, but has anyone got some 
> suggestions. 
> 
> Cheers 
> 
> Terry Martin
> QUT Speech Lab
> Australia
>