[SRILM User List] How to compare LMs training with different vocabularies?

Mon Nov 19 18:40:59 PST 2012

Yes, the number of tokens in training corpus and test set segmented with
vocab2 is more than that with vocab1, so the word PPL diffed so much. I
also did an experiment as follows:
I compared each sentences' logprobs in test set under LM1 and LM2, and
separated the sentences into three sets.
A>B: represents sentences' logprobs with LM2 is higher than LM1
A=B: represents sentences' logprobs with LM2 is equal with LM1
A<B: represents sentences' logprobs with LM2 is lower than LM1
And I found that the CER with LM2 is lower than LM1 in A>B set. It seems
sentences with higher logprobs can have lower CER, assuming acoustic model
is the same under two vocabs. However,  I also found that the CER with LM2
is higher than LM1 in A=B set. So I was wondering whether the acoustic
model is also influenced by vocab and segmentation.

Thanks!

Meng CHEN

2012/11/15 Andreas Stolcke <stolcke at icsi.berkeley.edu>

>  On 11/5/2012 10:46 PM, Meng Chen wrote:
>
> Hi, I'm training LMs for Mandarin Chinese ASR task with two different
> vocabularies, vocab1(100635 vocabularies) and vocab2(102541
> vocabularies). In order to compare the performance of two vocabularies, the
> training corpus is the same, the test corpus is the same, and the word
> segmentation method is also the same, which is Forward Maximum Match. The
> only difference is the segmentation vocabulary and LM training vocabulary.
> I trained LM1 and LM2 with vocab1 and vocab2, and evaluate them on test
> set. The result is as follows:
>
>  LM1: logprobs = -84069.7, PPL = 416.452.
> LM2: logprobs = -82921.7, PPL = 189.564.
>
>  It seems LM2 is much better than LM1, either by logprobs or by PPL.
> However, when I am doing decoding with the corresponding Acoustic Model.
> The CER(Character Error Rate) of LM2 is higher than LM1. So I'm really
> confused. What's the relationship between the PPL and CER?  How to compare
> LMs with different vocabularies? Can you give me some suggestions or
> references? I'm really confused.
>
>  ps: There is a mistake in last mail, so I sent it gain.
>
>
> It is hard or impossible to compare two LMs with different vocabularies
> even when word segmentation is not an issue.
> But you are comparing two LMs using different segmentations (because the
> vocabularies differ), so the problem is even harder.
> The fact that your log probs differ by only a small amount (relatively)
> but the perplexities by a lot means that somehow your segmentation (the
> number of tokens in particular) in the two systems but be quite different.
> Is that the case?  Can you devise an experiment where the segmentations are
> kept as similar as possible?   For example, you could apply the same
> segmenter to both test cases, and then split OOV words into their
> single-character components where needed to apply the LM.
>
> Anecdotally, PPL and WER are not always well correlated, though when
> comparing a large range of models the correlation is strong (if not
> perfect).   See
> http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013 .
>
> I do not recall any systematic studies of the effect of Mandarin word
> segmentation on CER but given the amount of work in this area in the last
> decade there must be some.   Maybe someone else has some pointers ?
>
> Andreas
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121120/717f7453/attachment.html>