[SRILM User List] Strange log probabilities
Dmytro Prylipko
dmytro.prylipko at ovgu.de
Mon Oct 1 08:34:28 PDT 2012
Hi,
I am sorry for such a long e-mail, but I found a strange behavior during
the log probability calculation of the unigrams.
I have two language models trained on two text sets. Actually, those
sets are just two different sentences, repeated 100 times each:
ACTION_REJECT_003.train.txt:
<s> der gewünschte artikel ist nicht im koffer enthalten </s> (x 100)
ACTION_REJECT_004.train.txt:
<s> ihre aussage kann nicht verarbeitet werden </s> (x 100)
Also, I have defined few specific categories to build a class-based LM.
One class is numbers (ein, eine, eins, einundachtzig etc.), the second
one comprises names of specific items related to the task domain
(achselshirt, blusen), and the last one consists just of two words:
'wurde' and 'wurden'.
So, I am building two expanded class-based LMs using Witten-Bell
discounting (I triedalso the default Good-Turing, but with the same result):
replace-words-with-classes classes=wizard.class.defs
ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt
ngram-count -text ACTION_REJECT_003.train.class.txt -lm
ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3
ngram -lm ACTION_REJECT_003.lm -write-lm ACTION_REJECT_003.expanded.lm
-order 3 -classes wizard.class.defs -expand-classes 3 -expand-exact 3
-vocab wizard.wlist
The second LM (ACTION_REJECT_004) is built using the same approach. But
these two models are pretty different.
ACTION_REJECT_003.expanded.lm has reasonable smoothed log probabilities
for the unseen unigrams:
\data\
ngram 1=924
ngram 2=9
ngram 3=8
\1-grams:
-0.9542425 </s>
-10.34236 <BREAK>
-99 <s> -99
-10.34236 ab
-10.34236 abgeben
[...]
-10.34236 überschritten
-10.34236 übertragung
\2-grams:
0 <s> der 0
0 artikel ist 0
0 der gewünschte 0
0 enthalten </s>
0 gewünschte artikel 0
0 im koffer 0
0 ist nicht 0
0 koffer enthalten 0
0 nicht im 0
\3-grams:
0 gewünschte artikel ist
0 <s> der gewünschte
0 koffer enthalten </s>
0 der gewünschte artikel
0 nicht im koffer
0 artikel ist nicht
0 im koffer enthalten
0 ist nicht im
\end\
Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have a zero
probability:
\data\
ngram 1=924
ngram 2=7
ngram 3=6
\1-grams:
-0.845098 </s>
-99 <BREAK>
-99 <s> -99
-99 ab
-99 abgeben
[...]
-0.845098 aussage -99
[...]
-99 überschritten
-99 übertragung
\2-grams:
0 <s> ihre 0
0 aussage kann 0
0 ihre aussage 0
0 kann nicht 0
0 nicht verarbeitet 0
0 sagen </s>
0 verarbeitet sagen 0
\3-grams:
0 ihre aussage kann
0 <s> ihre aussage
0 aussage kann nicht
0 kann nicht verarbeitet
0 verarbeitet sagen </s>
0 nicht verarbeitet sagen
\end\
None of the words from both training sentences belong to any class.
Also, I found that removing the last word from the second training
sentence fixes the problem.
Thus, for the following sentence:
<s> ihre aussage kann nicht </s>
corresponding LM has correctly discounted probabilities (also around
-10). Replacing 'werden' with any other word (I tried 'sagen', 'abgeben'
and 'beer') causes the same problem again.
Is it a bug or am I doing something wrong?
I would be appreciated for any advice. I also can provide all necessary
data and scripts if needed.
Sincerely yours,
Dmytro Prylipko.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121001/f8d77f3d/attachment.html>
More information about the SRILM-User
mailing list