FW: A simple question about SRILM

Mon May 17 13:05:31 PDT 2004

Hi Andreas,

Thanks for you super-fast reply!

I tried it like you suggested:
ngram-count -order 3 -gt1max 0 -gt1min 1 -gt2max 0 -gt2min 1 -gt3max 0
-gt3min 1 -text corpus.tags -lm corpus.tags.lm2 -debug 1

Many of the backoff weights indeed became 99 (which is good), but many
remained non-zero (although small: -6,-7,-8...)

Is there a way to make them all 99?

The debug messages I got are listed below.

Thanks a lot,
Roy.
------------------------------------------------------------------------
---------------------
corpus.tags: line 1892: 1892 sentences, 48332 words, 0 OOVs
0 zeroprobs, logprob= 0 ppl= 1 ppl1= 1
Good-Turing discounting 1-grams
GT-count [0] = 0
GT-count [1] = 0
warning: no singleton counts
GT discounting disabled
Good-Turing discounting 2-grams
GT-count [0] = 0
GT-count [1] = 126
GT discounting disabled
Good-Turing discounting 3-grams
GT-count [0] = 0
GT-count [1] = 2142
GT discounting disabled
discarded 1 2-gram contexts containing pseudo-events
discarded 2 3-gram contexts containing pseudo-events
writing 41 1-grams
writing 800 2-grams
writing 5145 3-grams

> -----Original Message-----
> From: Andreas Stolcke [mailto:stolcke at speech.sri.com] 
> Sent: Monday, May 17, 2004 7:38 PM
> To: Roy Bar Haim
> Cc: srilm-user at speech.sri.com
> Subject: Re: FW: A simple question about SRILM 
> 
> 
> 
> In message 
> <001701c43c3c$65fc62c0$34284484 at cs.technion.ac.il>you wrote:
> > Hi,
> > 
> > I have the same problem. I want the LM to give maximum-likelihood 
> > estimates. That is, all the backoff weights should be zero.
> > 
> > I applied the solution below, but still I get backoff weights.
> > 
> > For example, when I build the lm like this:
> > ngram-count -order 3 -gt1max 0 -gt2max 0 -gt3max 0 -text 
> corpus.tags 
> > -lm corp us.tags.lm
> > 
> > I found that the once-occuring trigrams DO NOT APPEAR in the lm, so 
> > probablit y mass is still discounted.
> 
> the default minimum coccurrence count for trigrams is 2.  set 
> it to 1 to 
> include all trigrams:
> 
> -gt3min 1 etc.
> 
> that's why you still get backoff.
> 
> > 
> > When I turned on the debug messages, I saw many messages like:
> > warning: 0 backoff probability mass left for "AT SCLN" -- 
> incrementing denomi
> > nator 
> > 
> > Does it mean that smoothing is enforced here?
> > 
> > Is there a way to get a pure maximum-likelihood language model, 
> > without backo ff weights at all, using ngram-count?
> 
> see above.
> 
> --Andreas 
> 
>