[SRILM User List] Strange log probabilities

Mon Oct 1 08:34:28 PDT 2012

Hi,

I am sorry for such a long e-mail, but I found a strange behavior during 
the log probability calculation of the unigrams.

I have two language models trained on two text sets. Actually, those 
sets are just two different sentences, repeated 100 times each:

ACTION_REJECT_003.train.txt:
<s> der gewünschte artikel ist nicht im koffer enthalten  </s>  (x 100)

ACTION_REJECT_004.train.txt:
<s> ihre aussage kann nicht verarbeitet werden </s> (x 100)

Also, I have defined few specific categories to build a class-based LM.
One class is numbers (ein, eine, eins, einundachtzig etc.), the second 
one comprises names of specific items related to the task domain 
(achselshirt, blusen), and the last one consists just of two words: 
'wurde' and 'wurden'.

So, I am building two expanded class-based LMs using Witten-Bell 
discounting (I triedalso the default Good-Turing, but with the same result):

replace-words-with-classes classes=wizard.class.defs 
ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt

ngram-count -text ACTION_REJECT_003.train.class.txt -lm 
ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3

ngram -lm ACTION_REJECT_003.lm  -write-lm ACTION_REJECT_003.expanded.lm 
-order 3 -classes wizard.class.defs -expand-classes 3 -expand-exact 3 
-vocab wizard.wlist

The second LM (ACTION_REJECT_004) is built using the same approach. But 
these two models are pretty different.

ACTION_REJECT_003.expanded.lm has reasonable smoothed log probabilities 
for the unseen unigrams:

\data\
ngram 1=924
ngram 2=9
ngram 3=8

\1-grams:
-0.9542425    </s>
-10.34236    <BREAK>
-99    <s>    -99
-10.34236    ab
-10.34236    abgeben

[...]

-10.34236    überschritten
-10.34236    übertragung

\2-grams:
0    <s> der    0
0    artikel ist    0
0    der gewünschte    0
0    enthalten </s>
0    gewünschte artikel    0
0    im koffer    0
0    ist nicht    0
0    koffer enthalten    0
0    nicht im    0

\3-grams:
0    gewünschte artikel ist
0    <s> der gewünschte
0    koffer enthalten </s>
0    der gewünschte artikel
0    nicht im koffer
0    artikel ist nicht
0    im koffer enthalten
0    ist nicht im

\end\

Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have a zero 
probability:

\data\
ngram 1=924
ngram 2=7
ngram 3=6

\1-grams:
-0.845098    </s>
-99    <BREAK>
-99    <s>    -99
-99    ab
-99    abgeben
[...]
-0.845098    aussage    -99
[...]
-99    überschritten
-99    übertragung

\2-grams:
0    <s> ihre    0
0    aussage kann    0
0    ihre aussage    0
0    kann nicht    0
0    nicht verarbeitet    0
0    sagen </s>
0    verarbeitet sagen    0

\3-grams:
0    ihre aussage kann
0    <s> ihre aussage
0    aussage kann nicht
0    kann nicht verarbeitet
0    verarbeitet sagen </s>
0    nicht verarbeitet sagen

\end\

None of the words from both training sentences belong to any class.

Also, I found that removing the last word from the second training 
sentence fixes the problem.
Thus, for the following sentence:

<s> ihre aussage kann nicht  </s>

corresponding LM has correctly discounted probabilities (also around 
-10). Replacing 'werden' with any other word (I tried 'sagen', 'abgeben' 
and 'beer') causes the same problem again.

Is it a bug or am I doing something wrong?
I would be appreciated for any advice. I also can provide all necessary 
data and scripts if needed.

Sincerely yours,
Dmytro Prylipko.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121001/f8d77f3d/attachment.html>