[SRILM User List] basic LM with GT discount and Katz backoff: evaluating probabilities
Julia Olcoz Martínez
537333 at unizar.es
Wed Aug 10 01:52:59 PDT 2011
Hello,
I am trying to verify if I understand smoothing and backoff techniques
when generating a Language Model (LM). Thus, I use SRILM Toolkit as
follows:
ngram-count -text train.txt -no-sos -no-eos -order 2 -lm LM_2gram_train.arpa
in order to create a LM, trained by the train.txt and without eos
token. My goal is to calculate by my own the resulting probabilities
when applying GT smoothing, and verify if results are correct by
comparing to the ones obtained by SRILM. The fact is that I do not
obtain the same values..
In order to do it easy, I have edited a train corpus so simple,
compound only by 3 items (w1, w2, w3) and the eos ``.``:
line content
1 w1 w2 w2 w1 .
2 w1 w2 w2 w1 w1 w1 .
3 w2 w1 w1 w2 .
4 w1 .
5 w1 w1 w1 .
6 w3 w3 w3 w3 w3 w3 w3 w3 w3 w3 w3 .
7 w1 w3
8 w2 w3
9 w3 w1
10 w3 w2
and I calculate log probabilities as follows:
x r nr nr+1 r* PGT(x) PMLE(x) PGT'(x) log10[PGT'(x)]
w1w1 5 1 0 0,00 0,00 0,19 0,19 -0,73
w1w2 3 2 1 2,00 0,07 0,11 0,07 -1,13
w1w3 1 4 1 0,50 0,02 0,04 0,02 -1,73
w2w1 3 2 1 2,00 0,07 0,11 0,07 -1,13
w2w2 2 1 2 6,00 0,22 0,07 0,22 -0,65
w2w3 1 4 1 0,50 0,02 0,04 0,02 -1,73
w3w1 1 4 1 0,50 0,02 0,04 0,02 -1,73
w3w1 1 4 1 0,50 0,02 0,04 0,02 -1,73
w3w3 10 1 0 0,00 0,00 0,37 0,37 -0,43
sum 27 1,00 1,00
where: x is the bigram, r the number of counts of x, nr the number of
bigrams with r counts, nr+1 the number of bigrams with r+1 counts,
r*=(r+1)(nr+1)/nr, PGT(x)=r*/sum(r), PMLE(x)=r/sum(r), PGT'(x)= (r>=k
? PMLE(x): PGT(x)) with k=5.
SRILM results are:
\data\
ngram 1=5
ngram 2=9
\1-grams:
-99 </s>
-99 <s>
-0.4220737 w1 0
-0.6651117 w2 0
-0.3921105 w3 0
\2-grams:
-0.6627578 w1 w1
-0.1856366 w1 w2
-0.8846066 w1 w3
-0.1249387 w2 w1
-1 w2 w2
-0.8239087 w2 w3
-0.7269987 w3 w1
-0.7269987 w3 w2
-0.20412 w3 w3
\end\
I see that backoff values are null because of all possible bigrams are
seen in the train corpus. (Later, I would like to do all this, but
removing lines 7 to 10 from the corpus and trying to calculate Katz
backoff weights). For instance, I obtain SRILM results of
probabilities only for 1-grams, using MLE as shown:
y C(y) PMV(y) log[PMV(y) ]
w1 14 0,38 -0,4220737
w2 8 0,22 -0,6651117
w3 15 0,41 -0,3921105
sum 37 1,00
but why not with 2-grams?
Thank you in advance.
More information about the SRILM-User
mailing list