Help needed with SRILM

Thu Oct 18 10:29:41 PDT 2007

> 
> Hi Andreas,
> 
> First of all, thank you for the fast replay last time.
> I have read you answer to Roy Bar - Haim, and tried to follow. I found that
> there were duplicate parts in the training data, and I have  erased them,
> and I have tried to create the language model form a corpus 10 times larger,
> but it did not aid. I have managed to
> get rid of the warning only by changing the -gt1n(min/max) options.
> By doing this, I have discovered that the performance of the language model
> is greatly affected by the probability given to the <unk> token. I use
> ngram-count like this :
> 
> ngram-count -text corp.out -lm ngram-count_output/lm_2iter.lm -unk -order 3
> -gt1min 0 -gt1max 2
> 
> So, as far as I understand, there should be no occurance of unk in the
> corpus. But, unk gets a high probability - higher even than words that did
> appear one in the corpus. Only when I disable discounting I get low
> probability for <unk>.

Here is the problem:  you are estimating an LM with <unk> from data that 
doesn't have any instance of <unk>.  As a result, <unk> gets all the 
unigram probability mass that is left after discounting the observed 
unigrams, and that can be substantial.
This is because all the discounted unigram probability mass is distributed
over all the zeroton words, and in this case <unk> is the only zeroton word.
(If there are no zeroton words, then the discounted mass is added evenly to
ALL the words.)

Incidentally, when you try this with -gt1max 1 (the default) on the
Switchboard counts under $SRILM/test/tests/ngram-count-gt you get

-5.503182	<unk>

a very small probability. Already, with -gt1max 2 you get 

-2.558506       <unk>

which indeed is larger than many observed words. But that is not unexpected.
After all, <unk> is representing ALL unobserved words.

The proper remedy is to limit your LM vocabulary to something less than the 
observed words, so that the remaining words can give you a meaniningful 
estimate for unobserved words.

> Is there an option to set a fixed probability for the
> <unk>?

No, there isn't.  But there is a trick to achieve a similar effect.
Since your data doesn't contain any <unk> you can fake some.
Just make a count file that contains some fictitious occurrences for <unk>,
e.g.,

<unk>		100

and call this UNK.counts.
Then add those counts to your real data, e.g., 

ngram-count -text corp.out -read UNK.counts -lm ngram-count_output/lm_2iter.lm -unk -order 3 -gt1min 0 -gt1max 2

And of course you can play with the fake count value to achieve a result 
that is reasonable, or even optimal on some held-out data.

> 
> BTW : I changed the LM.cc a bit, so when I call ngram -ppl it acts as a
> probability server - it listens on a port, and gets sequences of words and
> returns its probability.
> Do you want me to send you the code for it, so it could be added as a
> feature ?

Please do send the code.  I wouldn't want to modify the existing meaning
of -ppl, but a new option with this functionality is something that
several people have asked about.

Andreas

> 
> Regards,
> Elad Dinur
> 
> On 9/23/07, Andreas Stolcke <stolcke at speech.sri.com> wrote:
> >
> > Elad Dinur wrote:
> > > Hello Andreas And/Or Jing,
> > >
> > > I am a graduate student in the Hebrew University of Jerusalem, guided
> > > by Ari Rappoport of The Hebrew University.
> > > I am working on Unsupervised segmentation of words, with emphasis on
> > > semitic languages, developing on Modern Hebrew.
> > > I am using SRILM to generate a trigram language model, and finding the
> > > probability of a sentence with the model.
> > > I am using ngram-count with the default setting, As far as I
> > > understand that means Good-Turing discounting with Katz Backoff.
> > > I get the following warning :
> > >
> > > warning: discount coeff 1 is out of range: 1.79427e-17
> > >
> > > I wonder if you can direct me to a document which elaborates on this
> > warning.
> > > Thanks in advance,
> > > Elad Dinur.
> > >
> > You can find the answer to this and many other questions by going to
> >
> > http://www.speech.sri.com/projects/srilm/mail-archive/srilm-user/
> >
> > and searching for "discount coeff 1 is out of range".
> >
> > Andreas
> >
> >
> >
> >
> 
> 
> -- 
> what ?!
> 
> ------=_Part_11378_11699390.1192208454625
> Content-Type: text/html; charset=ISO-8859-1
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> Hi Andreas,<br><br>First of all, thank you for the fast replay last time. <br
> >I have read you answer to Roy Bar - Haim, and tried to follow. I found that 
> there were duplicate parts in the training data, and I have&nbsp; erased them
> , and I have tried to create the language model form a corpus 10 times larger
> , but it did not aid. I have managed to 
> <br>get rid of the warning only by changing the -gt1n(min/max) options. <br>B
> y doing this, I have discovered that the performance of the language model is
>  greatly affected by the probability given to the &lt;unk&gt; token. I use ng
> ram-count like this :
> <br><br>ngram-count -text corp.out -lm ngram-count_output/lm_2iter.lm -unk -o
> rder 3 -gt1min 0 -gt1max 2 <br> <br>So, as far as I understand, there should 
> be no occurance of unk in the&nbsp;  corpus. But, unk gets a high probability
>  - higher even than words that did appear one in the corpus. Only when I disa
> ble discounting I get low probability for &lt;unk&gt;. Is there an option to 
> set a fixed probability for the &lt;unk&gt;?
> <br><br>BTW : I changed the LM.cc a bit, so when I call ngram -ppl it acts as
>  a probability server - it listens on a port, and gets sequences of words and
>  returns its probability. <br>Do you want me to send you the code for it, so 
> it could be added as a feature ?
> <br><br>Regards,<br>Elad Dinur<br><br><div><span class="gmail_quote">On 9/23/
> 07, <b class="gmail_sendername">Andreas Stolcke</b> &lt;<a href="mailto:stolc
> ke at speech.sri.com">stolcke at speech.sri.com</a>&gt; wrote:</span><blockquote cl
> ass="gmail_quote" style="border-left: 1px solid rgb(204, 204, 204); margin: 0
> pt 0pt 0pt 0.8ex; padding-left: 1ex;">
> Elad Dinur wrote:<br>&gt; Hello Andreas And/Or Jing,<br>&gt;<br>&gt; I am a g
> raduate student in the Hebrew University of Jerusalem, guided<br>&gt; by Ari 
> Rappoport of The Hebrew University.<br>&gt; I am working on Unsupervised segm
> entation of words, with emphasis on
> <br>&gt; semitic languages, developing on Modern Hebrew.<br>&gt; I am using S
> RILM to generate a trigram language model, and finding the<br>&gt; probabilit
> y of a sentence with the model.<br>&gt; I am using ngram-count with the defau
> lt setting, As far as I
> <br>&gt; understand that means Good-Turing discounting with Katz Backoff.<br>
> &gt; I get the following warning :<br>&gt;<br>&gt; warning: discount coeff 1 
> is out of range: 1.79427e-17<br>&gt;<br>&gt; I wonder if you can direct me to
>  a document which elaborates on this warning.
> <br>&gt; Thanks in advance,<br>&gt; Elad Dinur.<br>&gt;<br>You can find the a
> nswer to this and many other questions by going to<br><br><a href="http://www
> .speech.sri.com/projects/srilm/mail-archive/srilm-user/">http://www.speech.sr
> i.com/projects/srilm/mail-archive/srilm-user/
> </a><br><br>and searching for &quot;discount coeff 1 is out of range&quot;.<b
> r><br>Andreas<br><br><br><br></blockquote></div><br><br clear="all"><br>-- <b
> r>what ?!
> 
> ------=_Part_11378_11699390.1192208454625--