[SRILM User List] Please help me understand the debug info of the -interpolate -kndiscount

贺天行 cloudygooseg at gmail.com
Wed May 29 07:23:09 PDT 2013


I'm terribly sorry that it seems when I do the calculation following the
manual, I messed up with the Ds so I can't get the output right.
Now I can get the g() for the unigram following the manual
Now my question becomes simple, when computing the bow() for the unigram,
there are two ways in the manual:
Let *Z1 *be the set {*z*: *c*(*a*_*z*) > 0}. For highest order N-grams we
have:

	*g*(*a*_*z*)  = max(0, *c*(*a*_*z*) - *D*) / *c*(*a*_)
	bow(*a*_) = 1 - Sum_*Z1* *g*(*a*_*z*)
	        = 1 - Sum_*Z1* *c*(*a*_*z*) / *c*(*a*_) + Sum_*Z1* *D* / *c*(*a*_)
	        = *D* *n*(*a*_*) / *c*(*a*_)


Let *Z2 *be the set {*z*: *n*(*_*z*) > 0}. For lower order N-grams we have:

	*g*(_*z*)  = max(0, *n*(*_*z*) - *D*) / *n*(*_*)
	bow(_) = 1 - Sum_*Z2* *g*(_*z*)
	       = 1 - Sum_*Z2* *n*(*_*z*) / *n*(*_*) + Sum_*Z2* *D* / *n*(*_*)
	       = *D* *n*(_*) / *n*(*_*)

I don't know which equation to take computing the bow() for the
unigram , and for unigram, what does 'a' and '_' means respectively?

Also, I still don't get hold of the -debug 5 output in my last mail.

Terribly sorry again for my mistake, hope didn't waste your time and many thanks


Goose



2013/5/29 贺天行 <cloudygooseg at gmail.com>

> Hello, I'm trying to understand how does SRILM gives us the output in the
> lm file, but I can not figure out how these numbers come from.
>
> ngram-count -order 2 -gt1min 1 -gt2min 1 -gt3min 1 -text test_htx.dat
> -write1 cnt1 -write2 cnt2 -write3 cnt3 -kndiscount1 -kndiscount2
> -kndiscount3 -debug 5 -lm lmtest2
> test_htx.dat: line 22: 22 sentences, 67 words, 0 OOVs
> 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
> using ModKneserNey for 1-grams
> modifying 1-gram counts for Kneser-Ney smoothing
> Kneser-Ney smoothing 1-grams
> n1 = 2
> n2 = 4
> n3 = 4
> n4 = 4
> D1 = 0.2
> D2 = 1.4
> D3+ = 2.2
> using ModKneserNey for 2-grams
> Kneser-Ney smoothing 2-grams
> n1 = 34
> n2 = 10
> n3 = 3
> n4 = 3
> D1 = 0.62963
> D2 = 1.43333
> D3+ = 0.481481
> CONTEXT  WORD </s> NUMER 9 DENOM 52 DISCOUNT 0.755556 LPROB -0.883494
> CONTEXT  WORD Alice NUMER 3 DENOM 52 DISCOUNT 0.266667 LPROB -1.81291
>                                                                ........
> In the lm file:
> -99 <s> 0.1888525
> -1.309463 Alice -0.02817659
>                                                                .........
> I'm trying to understand the line
> CONTEXT  WORD Alice NUMER 3 DENOM 52 DISCOUNT 0.266667 LPROB -1.81291
> I know the NUMBER 3 means
> c(* Alice)=3
> I can't figure out the other parameters, and how are they calculated, and
> how are the result
> -1.309463 Alice -0.02817659
> calculated
>
> I have referred to Chen's paper and SRILM ngram-discount manual, but I
> still don't know what's going on
>
> This is my cnt1 file
> <s> 22
> </s> 9
> Alice 3
> loves 4
> Bob 2
> also 3
> Kai 2
> KaiKai 3
> KK 3
> hates 2
> YY 5
> Miss 4
> MM 1
> b3 4
> a3 4
> c3 1
> d3 2
>
> Thank you very much.
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130529/bfb68a2f/attachment.html>


More information about the SRILM-User mailing list