Interpolating Lang Models for Indonesian ASR
최준기
joonki74 at etri.re.kr
Wed Sep 7 10:01:18 PDT 2005
Hi,
I wonder your texts have sentence boundary symbols such as "<s>, </s>".
In CMU toolkit, you should define the symbols explictly in context cue file to control
these symbols.
According to my experience, with same discounting scheme, same cut-off value and careful usage
of back-off options(from unknown words and sentence boundary), the two LM toolkit generate
almost same perplexities. (except some round-off errors)
On interpolation problem, IMHO, I suggest that you compare the N-gram probability stream on same text.
(using fprobs files in CMU LM toolkit and -debug 2 option in SRI LM toolkit)
regards,
Joon Ki Choi
> -----Original Message-----
> From: owner-srilm-user at speech.sri.com
> [mailto:owner-srilm-user at speech.sri.com] On Behalf Of
> tl.martin at qut.edu.au
> Sent: Wednesday, September 07, 2005 9:55 AM
> To: srilm-user at speech.sri.com
> Subject: Interpolating Lang Models for Indonesian ASR
>
>
> Hi All,
>
> At present I am trying to use the SRI tools to improve the
> LM for an Indonesian ASR system we are building. We have
> just over ten hours of Australian Broadcast Commision
> training data and at present the system gets just over 80%
> on a heldout test set, with a bigram LM trained on the
> training data. However, we also have approximately 12
> million words of Text from Indonesian papers Kompass and
> Tempo and were hoping that we could interpolate these with
> the existing ABC LM to improve the ngram estimates and
> subsequent perplexity.
>
> Evaluating the ABC data PPL on a separate dev transcript
> provides a ppl=297
>
> Noting the advice given in the package notes when using
> limited vocabs(vocab is 11000 words) I computed discount
> coefficients first on unlimited vocab.and then subsequently
> used these as input to a second pass of n-gram-count to get
> the LM. I used good-turing.
>
>
> I then ran
>
> ngram -lm $sPATH_OUTPUT/lm/arpa.bo.lm -order 2 -vocab
> $DESIRED_VOCAB -limit-vocab -ppl output/sr\
> i_trans/$PPL_CORPUS.dev.sri.trans
>
> to get perplexity score
>
> Using a similar technique using the much larger set of
> Kompass text produces a ppl score of 808 when evaluated on
> the ABC dev set.
>
>
> All is well and good, until I try and interpolate the 2. I
> have trialled two approaches. The first uses the dynamic
> interpolation capabality incorporated in ngram.Using
>
> ngram -bayes 0 -lm ./ABC/lm/arpa.bo.lm -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 2 -
> ppl ./sri_trans/ABC.dev.sri.trans gives a ppl of 342 ie
> much worse than the original 297.
>
> I then tried using the "compute-best-mix" utility which
> starts of as expected at lambda values 0.5 and 0.5 and
> iterates to 0.66 and 0.33.Plugging these vals into
>
> ngram -lm ./ABC/lm/arpa.bo.lm -lambda 0.66 -mix-
> lm ./Kompass/lm/arpa.bo.lm -debug 1 -ppl output/sri_trans/
> $PPL_CORPUS.dev.sri.tran|tail yields
> s
>
> ppl= 331.8 ppl1= 608.504
>
>
> still worse. I would expect it to perhaps stay the same, and
> iterate to lambda values which excluded the Kompass data,
> but these seem to be at odds with the ppl score. I then
> trialled the same technique using Switchboard and Gigaword
> and got the normal expected behaviour ie improvement
>
> Unsure of whether this was because the Kompass data was
> unsuitable or I was just making a foolish error somewhere I
> trialled CMU LM toolkit. Agian using gt discounting to build
> a lm and evaluate on ABC devset gives a ppl of 268 which was
> a little surprising. More surprising was when I used their
> interpolation tools. To cut the story short it produces:
>
>
> weights: 0.547 0.453 (7843 items) -
> -> PP=152.624029
>
> =============> TOTAL PP = 152.624
>
> No doubt the devil is in the detail, but has anyone got some
> suggestions.
>
> Cheers
>
> Terry Martin
> QUT Speech Lab
> Australia
>
More information about the SRILM-User
mailing list