Some SRILM questions

Fri Mar 14 11:41:10 PST 2003

In message <Pine.SGI.4.21.0303141511560.3407821-100000 at james.hut.fi>you wrote:
> Hi,
> 
> I have a couple of question about the SRILM toolkit and I was hoping you
> would have time to answer my questions:
> 
> 1) Is there any way to make a n-gram model without sentence start and end
> tags (<s>,</s>) ?

yes.  you can supply the counts yourself and make sure no ngrams
containing those symbols are included.  both symbols will still
appear in the unigrams, but if you declare </s> to be a "non-event"
(ngram-count -nonevents) then they will get 0 probability.

(I haven't tried this recently, let me know if you run into problems).

> 
> 2) I tried teaching a Kneser-Ney smoothed 5-gram model 
> ( -kndiscount1 -kndiscount2 -kndiscount3 -kndiscount4 -kndiscount5 )
> and got the error
> warning: one of required count-of-counts is zero
> error in discount estimator for order 4
> 
> I suppose this is a feature of K-N smoothing. Is there any way around this
> or have I done something stupid ?

KN (as well as GT) discounting require count-of-counts statistics,
and to work well they need to be from "natural" data, in the sense that you
didn't delete, duplicate, or otherwise manipulate the raw corpus counts.
For example, you might be using a vocabulary that does not include all
the training words, and that would skew the count-of-count statistics.

If there is nothing obvious that you did, try using the "make-big-lm"
script, which is a wrapper around ngram-count that avoids truncating 
the vocabulary prior to estimating the discounting statistics.

--Andreas