[SRILM User List] How to compare LMs training with different vocabularies?

Wed Nov 14 13:27:23 PST 2012

On 11/5/2012 10:46 PM, Meng Chen wrote:
> Hi, I'm training LMs for Mandarin Chinese ASR task with two different 
> vocabularies, vocab1(100635 vocabularies) and vocab2(102541 
> vocabularies). In order to compare the performance of two 
> vocabularies, the training corpus is the same, the test corpus is the 
> same, and the word segmentation method is also the same, which 
> is Forward Maximum Match. The only difference is the segmentation 
> vocabulary and LM training vocabulary. I trained LM1 and LM2 with 
> vocab1 and vocab2, and evaluate them on test set. The result is as 
> follows:
>
> LM1: logprobs = -84069.7, PPL = 416.452.
> LM2: logprobs =-82921.7, PPL = 189.564.
>
> It seems LM2 is much better than LM1, either by logprobs or by PPL. 
> However, when I am doing decoding with the corresponding Acoustic 
> Model. The CER(Character Error Rate) of LM2 is higher than LM1. So I'm 
> really confused. What's the relationship between the PPL and CER?  How 
> to compare LMs with different vocabularies? Can you give me some 
> suggestions or references? I'm really confused.
>
> ps: There is a mistake in last mail, so I sent it gain.

It is hard or impossible to compare two LMs with different vocabularies 
even when word segmentation is not an issue.
But you are comparing two LMs using different segmentations (because the 
vocabularies differ), so the problem is even harder.
The fact that your log probs differ by only a small amount (relatively) 
but the perplexities by a lot means that somehow your segmentation (the 
number of tokens in particular) in the two systems but be quite 
different.  Is that the case?  Can you devise an experiment where the 
segmentations are kept as similar as possible?   For example, you could 
apply the same segmenter to both test cases, and then split OOV words 
into their single-character components where needed to apply the LM.

Anecdotally, PPL and WER are not always well correlated, though when 
comparing a large range of models the correlation is strong (if not 
perfect).   See 
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=659013 .

I do not recall any systematic studies of the effect of Mandarin word 
segmentation on CER but given the amount of work in this area in the 
last decade there must be some.   Maybe someone else has some pointers ?

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121114/278ec118/attachment.html>