FW: A simple question about SRILM

Andreas Stolcke stolcke at speech.sri.com
Tue May 18 20:03:39 PDT 2004


In message <002701c43c4a$4f810b00$34284484 at cs.technion.ac.il>you wrote:
> Hi Andreas,
> 
> Thanks for you super-fast reply!
> 
> I tried it like you suggested:
> ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0
> -gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1
> 
> Many of the backoff weights indeed became 99 (which is good), but many
> remained non-zero (although small: -6,-7,-8...)
> 
> Is there a way to make them all 99?

This might not be necessary. 

If the left-over probability mass in some context is 0 (as it should
be when using ML estimates) AND the sum of the lower-order probabilities
for the occurring N-grams is also 0 (since those are also ML estimates),
the backoff weight is 0/0, and due to numerical inaccuracies this may turn
out to be one of the values your observed. (The code catches actual
0/0 divisions and generates -99 in those cases.)
However, this is not a problem because the backoff log prob value for one of 
the non-observed ngrams would be -infinity, and the particular value of 
the backoff weight that gets applied doesn't matter for the outcome
(-infinity plus any value is still -infinity).

To verify that that's the case just feed some of those unobserved
ngrams to ngram -debug 2 -ppl and make sure the log probabilities are -infinity.

--Andreas 


> 
> The debug messages I got are listed below.
> 
> Thanks a lot,
> Roy.
> ------------------------------------------------------------------------
> ---------------------
> corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs
> 0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
> Good-Turing discounting 1-grams
> GT-count [0] = 0
> GT-count [1] = 0
> warning: no singleton counts
> GT discounting disabled
> Good-Turing discounting 2-grams
> GT-count [0] = 0
> GT-count [1] = 126
> GT discounting disabled
> Good-Turing discounting 3-grams
> GT-count [0] = 0
> GT-count [1] = 2142
> GT discounting disabled
> discarded 1 2-gram contexts containing pseudo-events
> discarded 2 3-gram contexts containing pseudo-events
> writing 41 1-grams
> writing 800 2-grams
> writing 5145 3-grams
> 
> > -----Original Message-----
> > From: Andreas Stolcke [mailto:stolcke at speech.sri.com] 
> > Sent: Monday, May 17, 2004 7:38 PM
> > To: Roy Bar Haim
> > Cc: srilm-user at speech.sri.com
> > Subject: Re: FW: A simple question about SRILM 
> > 
> > 
> > 
> > In message 
> > <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote:
> > > Hi,
> > > 
> > > I have the same problem. I want the LM to give maximum-likelihood 
> > > estimates. That is, all the backoff weights should be zero.
> > > 
> > > I applied the solution below, but still I get backoff weights.
> > > 
> > > For example, when I build the lm like this:
> > > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text 
> > corpus.tags 
> > > -lm corp us.tags.lm
> > > 
> > > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so 
> > > probablit y mass is still discounted.
> > 
> > the default minimum coccurrence count for trigrams is 2.  set 
> > it to 1 to 
> > include all trigrams:
> > 
> > -gt3min 1 etc.
> > 
> > that's why you still get backoff.
> > 
> > > 
> > > When I turned on the debug messages, I saw many messages like:
> > > warning: 0 backoff probability mass left for "AT SCLN" -- 
> > incrementing denomi
> > > nator 
> > > 
> > > Does it mean that smoothing is enforced here?
> > > 
> > > Is there a way to get a pure maximum-likelihood language model, 
> > > without backo ff weights at all, using ngram-count?
> > 
> > see above.
> > 
> > --Andreas 
> > 
> > 
> 




More information about the SRILM-User mailing list