Interpolating Lang Models for Indonesian ASR
tl.martin at qut.edu.au
tl.martin at qut.edu.au
Tue Sep 6 17:55:09 PDT 2005
Hi All,
At present I am trying to use the SRI tools to improve the
LM for an Indonesian ASR system we are building. We have
just over ten hours of Australian Broadcast Commision
training data and at present the system gets just over 80%
on a heldout test set, with a bigram LM trained on the
training data. However, we also have approximately 12
million words of Text from Indonesian papers Kompass and
Tempo and were hoping that we could interpolate these with
the existing ABC LM to improve the ngram estimates and
subsequent perplexity.
Evaluating the ABC data PPL on a separate dev transcript
provides a ppl=297
Noting the advice given in the package notes when using
limited vocabs(vocab is 11000 words) I computed discount
coefficients first on unlimited vocab.and then subsequently
used these as input to a second pass of n-gram-count to get
the LM. I used good-turing.
I then ran
ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab
$DESIRED_VOCAB -limit-vocab -ppl output/sr\
i_trans/$PPL_CORPUS.dev.sri.trans
to get perplexity score
Using a similar technique using the much larger set of
Kompass text produces a ppl score of 808 when evaluated on
the ABC dev set.
All is well and good, until I try and interpolate the 2. I
have trialled two approaches. The first uses the dynamic
interpolation capabality incorporated in ngram.Using
ngram -bayes 0 -lm ./ABC/lm/arpa.bo.lm -mix-
lm ./Kompass/lm/arpa.bo.lm -debug 2 -
ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342 ie
much worse than the original 297.
I then tried using the "compute-best-mix" utility which
starts of as expected at lambda values 0.5 and 0.5 and
iterates to 0.66 and 0.33.Plugging these vals into
ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix-
lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/
$PPL_CORPUS.dev.sri.tran|tail yields
s
ppl= 331.8 ppl1= 608.504
still worse. I would expect it to perhaps stay the same, and
iterate to lambda values which excluded the Kompass data,
but these seem to be at odds with the ppl score. I then
trialled the same technique using Switchboard and Gigaword
and got the normal expected behaviour ie improvement
Unsure of whether this was because the Kompass data was
unsuitable or I was just making a foolish error somewhere I
trialled CMU LM toolkit. Agian using gt discounting to build
a lm and evaluate on ABC devset gives a ppl of 268 which was
a little surprising. More surprising was when I used their
interpolation tools. To cut the story short it produces:
weights: 0.547 0.453 (7843 items) -
-> PP=152.624029
=============> TOTAL PP = 152.624
No doubt the devil is in the detail, but has anyone got some
suggestions.
Cheers
Terry Martin
QUT Speech Lab
Australia
More information about the SRILM-User
mailing list