[SRILM User List] basic LM with GT discount and Katz backoff: evaluating probabilities

Wed Aug 10 01:52:59 PDT 2011

Hello,

I am trying to verify if I understand smoothing and backoff techniques  
when generating a Language Model (LM). Thus, I use SRILM Toolkit as  
follows:

ngram-count -text train.txt -no-sos -no-eos -order 2 -lm LM_2gram_train.arpa

in order to create a LM, trained by the train.txt and without eos  
token. My goal is to calculate by my own the resulting probabilities  
when applying GT smoothing, and verify if results are correct by  
comparing to the ones obtained by SRILM. The fact is that I do not  
obtain the same values..

In order to do it easy, I have edited a train corpus so simple,  
compound only by 3 items (w1, w2, w3) and the eos ``.``:

line content
1    w1 w2 w2 w1 .
2    w1 w2 w2 w1 w1 w1 .
3    w2 w1 w1 w2 .
4    w1 .
5    w1 w1 w1 .
6    w3 w3 w3 w3 w3 w3 w3 w3 w3 w3 w3 .
7    w1 w3
8    w2 w3
9    w3 w1
10   w3 w2

and I calculate log probabilities as follows:

x      r	nr	nr+1	r*	PGT(x)	PMLE(x)	PGT'(x)	log10[PGT'(x)]
w1w1   5	1	0	0,00	0,00	0,19	0,19	-0,73
w1w2   3	2	1	2,00	0,07	0,11	0,07	-1,13
w1w3   1	4	1	0,50	0,02	0,04	0,02	-1,73
w2w1   3	2	1	2,00	0,07	0,11	0,07	-1,13
w2w2   2	1	2	6,00	0,22	0,07	0,22	-0,65
w2w3   1	4	1	0,50	0,02	0,04	0,02	-1,73
w3w1   1	4	1	0,50	0,02	0,04	0,02	-1,73
w3w1   1	4	1	0,50	0,02	0,04	0,02	-1,73
w3w3   10	1	0	0,00	0,00	0,37	0,37	-0,43
sum  27					1,00	1,00

where: x is the bigram, r the number of counts of x, nr the number of  
bigrams with r counts, nr+1 the number of bigrams with r+1 counts,  
r*=(r+1)(nr+1)/nr, PGT(x)=r*/sum(r), PMLE(x)=r/sum(r), PGT'(x)= (r>=k  
? PMLE(x): PGT(x)) with k=5.

SRILM results are:
\data\
ngram 1=5
ngram 2=9

\1-grams:
-99	</s>
-99	<s>
-0.4220737	w1	0
-0.6651117	w2	0
-0.3921105	w3	0

\2-grams:
-0.6627578	w1 w1
-0.1856366	w1 w2
-0.8846066	w1 w3
-0.1249387	w2 w1
-1	w2 w2
-0.8239087	w2 w3
-0.7269987	w3 w1
-0.7269987	w3 w2
-0.20412	w3 w3

\end\

I see that backoff values are null because of all possible bigrams are  
seen in the train corpus. (Later, I would like to do all this, but  
removing lines 7 to 10 from the corpus and trying to calculate Katz  
backoff weights). For instance, I obtain SRILM results of  
probabilities only for 1-grams, using MLE as shown:

y	C(y)	PMV(y)	log[PMV(y) ]
w1	14	0,38	-0,4220737
w2	8	0,22	-0,6651117
w3	15	0,41	-0,3921105
sum	37	1,00

but why not with 2-grams?

Thank you in advance.