[SRILM User List] 回复: A confusion of the interpolated language model
海龙 史
shl.thcn at yahoo.com.cn
Fri Aug 28 22:21:25 PDT 2009
Hi,Thanks for your concern!
I do know that back-off weight is not a probability,but in the interpolated mod-kn smoothing method,bows are not supposed to be greater than 1.
In the man document of srilm ngram-discount.7.html,I've got this:
For back-off smoothing,there is
(1) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)
where f(a_z) depends on the smoothing method and the bow(a_) is calculated below:
Sum_Z p(a_z) = 1 Sum_Z1 f(a_z) + Sum_Z0 bow(a_) p(_z) = 1
(2) bow(a_) = (1- Sum_Z1 f(a_z)) / Sum_Z0 p(_z)
= (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 p(_z))
= (1 - Sum_Z1 f(a_z)) / (1 - Sum_Z1 f(_z))
but for interpolated smoothing, there is
(3) f(a_z) = g(a_z) + bow(a_) p(_z)
(4) p(a_z) = (c(a_z) > 0) ? f(a_z) : bow(a_) p(_z)
and
Sum_Z p(a_z) = 1
Sum_Z1 g(a_z) + Sum_Z bow(a_) p(_z) = 1
(5) bow(a_) = 1 - Sum_Z1 g(a_z)
(Where Z be the set of all words in the vocabulary, Z0 be the set of
all words with c(a_z) = 0, and Z1 be the set of all
words with c(a_z) > 0)
However in the srilm sourse codes ,it seems that the interpolated bows is calculated using (5) and then the probs and bows is trasfered into back-off model using (3) ,then the back-off version of the bows are recomputed using (2).I just don't understand why srilm do not use the bow calculated using (5)directedly.
Besides,I used to use the entropy-prune method to construct a language model:
~ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 0 -kndiscount -interpolate -prune 0.000000001 -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_pruned1e-9.lm
and there is definitely no bow greater than 1.
So this problem is wired and I wonder if anyone of you knows that.And was the command I used to build the mod-kn discount language model(where I want to exclude the 3-grams with the count of 1) correct?
~
ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min
2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm
1994-2003_lm_all_pruned.lm
Thank you very much!
史海龙
Hailoon Shi
w63,EE Dpt.Tinghua.Unv.Beijing.China
分享快乐,加倍快乐
________________________________
发件人: Yannick Estève <yannick.esteve at lium.univ-lemans.fr>
收件人: 海龙 史 <shl.thcn at yahoo.com.cn>
抄送: srilm-user at speech.sri.com
已发送: 2009/8/27(周四), 下午4:19:44
主题: Re: [SRILM User List] A confusion of the interpolated language model
Hi,
Back-off weights are not probabilities: they can be greater than 1.
So, your values are normal. You can have some explanations about back-off weight computation here, particularly for the use of the modified Kneser-Ney discounting method:
http://www.speech.sri.com/projects/srilm/manpages/pdfs/chen-goodman-tr-10-98.pdf
Regards,
Yannick Estève
LIUM - University of Le Mans
France
Le 27 août 09 à 09:21, 海龙 史 a écrit :
>
>
>
>
>
>I am a new student user of srilm from Asia.Here I used the command below to construct a interpolated mod-kn discount language model:
>~ ngram-count -read merge_counts_1994-2003.gz -gt1min 0 -gt2min 0 -gt3min 2 -kndiscount -interpolate -order 3 -vocab ChWord.lexno -lm 1994-2003_lm_all_pruned.lm
>
>
> However in my model several N-grams' back-off werght(bow) appears to be greater than 1.That is ,in the text LM file,I've got a line:
>-6.457229 <s> 1635 0.1270406
>(Here we just use a kind of index to represent a chinese word)
>in whitch the 1og10(bow) is greater than 0.We don't think a normal interplotate discount method can produce an N-gram bow greater than 1,besides this circumstance only occured to several(less than 5) different
> N-grams.So I am confused and would like to ask if there is someyone who encounterd this circumstance or happens to know what is wrong.
>Thank you very much!
>
>史海龙
>Hailoon Shi
>w63,EE Dpt.Thu Univ.PRC
>
>
>
>
>
>
>
>__________________________________________________
>赶快注册雅虎超大容量免费邮箱?
>http://cn.mail.yahoo.com_______________________________________________
>SRILM-User site list
>SRILM-User at speech.sri.com
>http://www.speech.sri.com/mailman/listinfo/srilm-user
___________________________________________________________
好玩贺卡等你发,邮箱贺卡全新上线!
http://card.mail.cn.yahoo.com/
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20090828/6b69f7d3/attachment.html>
More information about the SRILM-User
mailing list