[SRILM User List] Strange log probabilities

Tue Oct 2 13:35:07 PDT 2012

The problem is that your final vocabulary is introduced as a surprise in
the last step (to ngram). When class expansion likelihoods sum to exactly
1.0 there is no room for novelty in back off orders at this stage.

To get the correct behavior you must prime the initial language model with
a vocabulary of all either class tags or the individual words themselves.
E.g.

awk '{print $1}' wizard.class.defs | sort -u >wizard.classnames.txt

cat $datafile \
  | replace-words-with-classes classes=wizard.class.defs - \
  | ngram-count -text - -lm - -order 1 -wbdiscount -vocab
wizard.classnames.txt \
  > your-lm.1bo

# Expanding classes in your-lm.1bo now will give you the desired behavior.

HTH

&

On Tue, Oct 2, 2012 at 2:48 AM, Dmytro Prylipko <dmytro.prylipko at ovgu.de>wrote:

>  Hi,
>
> Thank you for the quick feedback.
>
> I found out something else remarkable: I tried to run the script on our
> cluster under CentOS (my workstation is running Kubuntu 12.04) and
> discovered that on the cluster all the LMs have zero probabilities for
> unseen 1-grams. No smoothing at all!
>
> The setup is of course different. Output of the uname -a on the cluster:
>
> Linux frontend1.service 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST
> 2010 x86_64 x86_64 x86_64 GNU/Linux
>
> On the workstation:
>
> Linux KS-PC113 3.2.0-31-generic-pae #50-Ubuntu SMP Fri Sep 7 16:39:45 UTC
> 2012 i686 i686 i386 GNU/Linux
>
> SRILM on the cluster was build with MACHINE_TYPE=i686-m64 (with and
> without _C option, both give the same result), on the workstation with
> MACHINE_TYPE=i686-gcc4
>
> LANG variable is en_US.UTF-8 on both machines. Replacing umlauts with
> regular characters gave no difference.
>
> What do you mean exactly under 'behavior of your local awk installation
> when it encounters extended chars'?
>
> So, I am sending you the minimal dataset for replicating it. Shell script
> buildtaglm.sh does all the job.
>
> Yours,
> Dmytro Prylipko.
>
>
> On Tue 02 Oct 2012 02:24:00 AM CEST, Anand Venkataraman wrote:
>
>
> On a first reading of your email I'm indeed surprised that the results
> differ between the two texts. Have you tried replacing the umlaut in
> the first corpus with a regular "u" and checked if you still get the
> same behavior. Check the LANG environment variable and the behavior of
> your local awk installation when it encounters extended chars.
>
> If the problem persists, please send me the two corpora, along with
> the class file and I'll be glad to take a look for you.
>
> &
>
> On Mon, Oct 1, 2012 at 8:34 AM, Dmytro Prylipko
> <dmytro.prylipko at ovgu.de <mailto:dmytro.prylipko at ovgu.de><dmytro.prylipko at ovgu.de>>
> wrote:
>
> Hi,
>
> I am sorry for such a long e-mail, but I found a strange behavior
> during the log probability calculation of the unigrams.
>
> I have two language models trained on two text sets. Actually,
> those sets are just two different sentences, repeated 100 times each:
>
> ACTION_REJECT_003.train.txt:
> <s> der gewünschte artikel ist nicht im koffer enthalten </s> (x
> 100)
>
> ACTION_REJECT_004.train.txt:
> <s> ihre aussage kann nicht verarbeitet werden </s> (x 100)
>
> Also, I have defined few specific categories to build a
> class-based LM.
> One class is numbers (ein, eine, eins, einundachtzig etc.), the
> second one comprises names of specific items related to the task
> domain (achselshirt, blusen), and the last one consists just of
> two words: 'wurde' and 'wurden'.
>
> So, I am building two expanded class-based LMs using Witten-Bell
> discounting (I triedalso the default Good-Turing, but with the
>
> same result):
>
> replace-words-with-classes classes=wizard.class.defs
> ACTION_REJECT_003.train.txt > ACTION_REJECT_003.train.class.txt
>
> ngram-count -text ACTION_REJECT_003.train.class.txt -lm
> ACTION_REJECT_003.lm -order 3 -wbdiscount1 -wbdiscount2 -wbdiscount3
>
> ngram -lm ACTION_REJECT_003.lm -write-lm
> ACTION_REJECT_003.expanded.lm -order 3 -classes wizard.class.defs
> -expand-classes 3 -expand-exact 3 -vocab wizard.wlist
>
>
> The second LM (ACTION_REJECT_004) is built using the same
> approach. But these two models are pretty different.
>
> ACTION_REJECT_003.expanded.lm has reasonable smoothed log
> probabilities for the unseen unigrams:
>
> \data\
> ngram 1=924
> ngram 2=9
> ngram 3=8
>
> \1-grams:
> -0.9542425 </s>
> -10.34236 <BREAK>
> -99 <s> -99
> -10.34236 ab
> -10.34236 abgeben
>
> [...]
>
> -10.34236 überschritten
> -10.34236 übertragung
>
> \2-grams:
> 0 <s> der 0
> 0 artikel ist 0
> 0 der gewünschte 0
> 0 enthalten </s>
> 0 gewünschte artikel 0
> 0 im koffer 0
> 0 ist nicht 0
> 0 koffer enthalten 0
> 0 nicht im 0
>
> \3-grams:
> 0 gewünschte artikel ist
> 0 <s> der gewünschte
> 0 koffer enthalten </s>
> 0 der gewünschte artikel
> 0 nicht im koffer
> 0 artikel ist nicht
> 0 im koffer enthalten
> 0 ist nicht im
>
> \end\
>
>
> Whereas in ACTION_REJECT_004.expanded.lm all unseen unigrams have
> a zero probability:
>
> \data\
> ngram 1=924
> ngram 2=7
> ngram 3=6
>
> \1-grams:
> -0.845098 </s>
> -99 <BREAK>
> -99 <s> -99
> -99 ab
> -99 abgeben
> [...]
> -0.845098 aussage -99
> [...]
> -99 überschritten
> -99 übertragung
>
> \2-grams:
> 0 <s> ihre 0
> 0 aussage kann 0
> 0 ihre aussage 0
> 0 kann nicht 0
> 0 nicht verarbeitet 0
> 0 sagen </s>
> 0 verarbeitet sagen 0
>
> \3-grams:
> 0 ihre aussage kann
> 0 <s> ihre aussage
> 0 aussage kann nicht
> 0 kann nicht verarbeitet
> 0 verarbeitet sagen </s>
> 0 nicht verarbeitet sagen
>
> \end\
>
>
> None of the words from both training sentences belong to any class.
>
> Also, I found that removing the last word from the second training
> sentence fixes the problem.
> Thus, for the following sentence:
>
> <s> ihre aussage kann nicht </s>
>
> corresponding LM has correctly discounted probabilities (also
> around -10). Replacing 'werden' with any other word (I tried
> 'sagen', 'abgeben' and 'beer') causes the same problem again.
>
> Is it a bug or am I doing something wrong?
> I would be appreciated for any advice. I also can provide all
> necessary data and scripts if needed.
>
> Sincerely yours,
> Dmytro Prylipko.
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com <mailto:SRILM-User at speech.sri.com><SRILM-User at speech.sri.com>
> http://www.speech.sri.com/mailman/listinfo/srilm-user
>
>
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121002/a804b17e/attachment.html>