Interpolating Lang Models for Indonesian ASR
Andreas Stolcke
stolcke at speech.sri.com
Wed Sep 7 09:32:09 PDT 2005
My first suggestion is to make sure that all LMs you are comparing and
interpolating use the same vocabulary. In SRILM you can enforce this
by using the -vocab option when building the LM.
PPLs over different vocabularies are of course not comparable, and it is
easy to mess this up when building LMs from different data sets.
--Andreas
In message <20050907105509.BSN53884 at mail-msgstore01.qut.edu.au>you wrote:
> Hi All,
>
> At present I am trying to use the SRI tools to improve the
> LM for an Indonesian ASR system we are building. We have
> just over ten hours of Australian Broadcast Commision
> training data and at present the system gets just over 80%
> on a heldout test set, with a bigram LM trained on the
> training data. However, we also have approximately 12
> million words of Text from Indonesian papers Kompass and
> Tempo and were hoping that we could interpolate these with
> the existing ABC LM to improve the ngram estimates and
> subsequent perplexity.
>
> Evaluating the ABC data PPL on a separate dev transcript
> provides a ppl=297
>
> Noting the advice given in the package notes when using
> limited vocabs(vocab is 11000 words) I computed discount
> coefficients first on unlimited vocab.and then subsequently
> used these as input to a second pass of n-gram-count to get
> the LM. I used good-turing.
>
>
> I then ran
>
> ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab
> $DESIRED_VOCAB -limit-vocab -ppl output/sr\
> i_trans/$PPL_CORPUS.dev.sri.trans
>
> to get perplexity score
>
> Using a similar technique using the much larger set of
> Kompass text produces a ppl score of 808 when evaluated on
> the ABC dev set.
>
>
> All is well and good, until I try and interpolate the 2. I
> have trialled two approaches. The first uses the dynamic
> interpolation capabality incorporated in ngram.Using
>
> ngram -bayes 0 -lm ./ABC/lm/arpa.bo.lm -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 2 -
> ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342 ie
> much worse than the original 297.
>
> I then tried using the "compute-best-mix" utility which
> starts of as expected at lambda values 0.5 and 0.5 and
> iterates to 0.66 and 0.33.Plugging these vals into
>
> ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/
> $PPL_CORPUS.dev.sri.tran|tail yields
> s
>
> ppl= 331.8 ppl1= 608.504
>
>
> still worse. I would expect it to perhaps stay the same, and
> iterate to lambda values which excluded the Kompass data,
> but these seem to be at odds with the ppl score. I then
> trialled the same technique using Switchboard and Gigaword
> and got the normal expected behaviour ie improvement
>
> Unsure of whether this was because the Kompass data was
> unsuitable or I was just making a foolish error somewhere I
> trialled CMU LM toolkit. Agian using gt discounting to build
> a lm and evaluate on ABC devset gives a ppl of 268 which was
> a little surprising. More surprising was when I used their
> interpolation tools. To cut the story short it produces:
>
>
> weights: 0.547 0.453 (7843 items) -
> -> PP=152.624029
>
> =============> TOTAL PP = 152.624
>
> No doubt the devil is in the detail, but has anyone got some
> suggestions.
>
> Cheers
>
> Terry Martin
> QUT Speech Lab
> Australia
More information about the SRILM-User
mailing list