[SRILM User List] Strange log probabilities

Tue Oct 2 02:48:16 PDT 2012

Hi,

Thank you for the quick feedback.

I found out something else remarkable: I tried to run the script on our 
cluster under CentOS (my workstation is running Kubuntu 12.04) and 
discovered that on the cluster all the LMs have zero probabilities for 
unseen 1-grams. No smoothing at all!

The setup is of course different. Output of the uname -a on the cluster:

Linux frontend1.service 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 
EST 2010 x86_64 x86_64 x86_64 GNU/Linux

On the workstation:

Linux KS-PC113 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7 16:39:45 
UTC 2012 i686 i686 i386 GNU/Linux

SRILM on the cluster was build with MACHINE_TYPE=i686-m64 (with and 
without _C option, both give the same result), on the workstation with 
MACHINE_TYPE=i686-gcc4

LANG variable is en_US.UTF-8 on both machines. Replacing umlauts with 
regular characters gave no difference.

What do you mean exactly under 'behavior of your local awk installation 
when it encounters extended chars'?

So, I am sending you the minimal dataset for replicating it. Shell 
script buildtaglm.sh does all the job.

Yours,
Dmytro Prylipko.

On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote:
>
> On a first reading of your email I'm indeed surprised that the results
> differ between the two texts. Have you tried replacing the umlaut in
> the first corpus with a regular "u" and checked if you still get the
> same behavior. Check the LANG environment variable and the behavior of
> your local awk installation when it encounters extended chars.
>
> If the problem persists, please send me the two corpora, along with
> the class file and I'll be glad to take a look for you.
>
> &
>
> On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko
> <dmytro.prylipko at ovgu.de <mailto:dmytro.prylipko at ovgu.de>> wrote:
>
> Hi,
>
> I am sorry for such a long e-mail, but I found a strange behavior
> during the log probability calculation of the unigrams.
>
> I have two language models trained on two text sets. Actually,
> those sets are just two different sentences, repeated 100 times each:
>
> ACTION_REJECT_003.train.txt:
> <s> der gewünschte artikel ist nicht im koffer enthalten </s> (x
> 100)
>
> ACTION_REJECT_004.train.txt:
> <s> ihre aussage kann nicht verarbeitet werden </s> (x 100)
>
> Also, I have defined few specific categories to build a
> class-based LM.
> One class is numbers (ein, eine, eins, einundachtzig etc.), the
> second one comprises names of specific items related to the task
> domain (achselshirt, blusen), and the last one consists just of
> two words: 'wurde' and 'wurden'.
>
> So, I am building two expanded class-based LMs using Witten-Bell
> discounting (I triedalso the default Good-Turing, but with the
> same result):
>
> replace-words-with-classes classes=wizard.class.defs
> ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt
>
> ngram-count -text ACTION_REJECT_003.train.class.txt -lm
> ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram -lm ACTION_REJECT_003.lm -write-lm
> ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs
> -expand-classes 3 -expand-exact 3 -vocab wizard.wlist
>
>
> The second LM (ACTION_REJECT_004) is built using the same
> approach. But these two models are pretty different.
>
> ACTION_REJECT_003.expanded.lm has reasonable smoothed log
> probabilities for the unseen unigrams:
>
> \data\
> ngram 1=924
> ngram 2=9
> ngram 3=8
>
> \1-grams:
> -0.9542425 </s>
> -10.34236 <BREAK>
> -99 <s> -99
> -10.34236 ab
> -10.34236 abgeben
>
> [...]
>
> -10.34236 überschritten
> -10.34236 übertragung
>
> \2-grams:
> 0 <s> der 0
> 0 artikel ist 0
> 0 der gewünschte 0
> 0 enthalten </s>
> 0 gewünschte artikel 0
> 0 im koffer 0
> 0 ist nicht 0
> 0 koffer enthalten 0
> 0 nicht im 0
>
> \3-grams:
> 0 gewünschte artikel ist
> 0 <s> der gewünschte
> 0 koffer enthalten </s>
> 0 der gewünschte artikel
> 0 nicht im koffer
> 0 artikel ist nicht
> 0 im koffer enthalten
> 0 ist nicht im
>
> \end\
>
>
> Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have
> a zero probability:
>
> \data\
> ngram 1=924
> ngram 2=7
> ngram 3=6
>
> \1-grams:
> -0.845098 </s>
> -99 <BREAK>
> -99 <s> -99
> -99 ab
> -99 abgeben
> [...]
> -0.845098 aussage -99
> [...]
> -99 überschritten
> -99 übertragung
>
> \2-grams:
> 0 <s> ihre 0
> 0 aussage kann 0
> 0 ihre aussage 0
> 0 kann nicht 0
> 0 nicht verarbeitet 0
> 0 sagen </s>
> 0 verarbeitet sagen 0
>
> \3-grams:
> 0 ihre aussage kann
> 0 <s> ihre aussage
> 0 aussage kann nicht
> 0 kann nicht verarbeitet
> 0 verarbeitet sagen </s>
> 0 nicht verarbeitet sagen
>
> \end\
>
>
> None of the words from both training sentences belong to any class.
>
> Also, I found that removing the last word from the second training
> sentence fixes the problem.
> Thus, for the following sentence:
>
> <s> ihre aussage kann nicht </s>
>
> corresponding LM has correctly discounted probabilities (also
> around -10). Replacing 'werden' with any other word (I tried
> 'sagen', 'abgeben' and 'beer') causes the same problem again.
>
> Is it a bug or am I doing something wrong?
> I would be appreciated for any advice. I also can provide all
> necessary data and scripts if needed.
>
> Sincerely yours,
> Dmytro Prylipko.
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com>
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121002/c2cafe7d/attachment.html>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: testbed.zip
Type: application/zip
Size: 21408 bytes
Desc: not available
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121002/c2cafe7d/attachment.zip>