singleton counts warning

Mon Mar 15 16:24:53 PST 2004

In message <40556B21.8080706 at irisa.fr>you wrote:
> Hi !
> I use SRILM to build a language model on letters. I have a warning that 
> I don't understand : "warning: no singleton counts
> GT discounting disabled"
> So, the model computed is wrong since some back-off weight are positives 
> (in log-probability) ! Do you know what does this warning mean ? I 
> thought no counts on single letters were computed but they were so I 
> can't find an explanation !

GT (and also KN) discounting need the number of words that appear only 
once (singletons) in the training corpus.  If that number is 0 the 
discounting formulae for those methods cannot be applied.

Please try using a different smoothing method, such as 
Witten-Bell to your letter LM, at least for the unigrams.

> 
> I've got another question, about the computation of unigram 
> log-probability. When I used the formula  : log[P(w)] = log[c(w)] - 
> log[N], where N is the number of word TOKENS in the training corpus, I 
> don't find exactly the value given by SRILM. Is there smoothing on 
> unigram ? And if so, how is it made ?

Of course there is smoothing.  I don't have time to elaborate on
the different smoothing algorthms implemented in SRILM, but you can either
study the code in Discount.cc, or refer to the excellent survey paper 
by Chen & Goodman (SEE ALSO section of the ngram-count(1) man page).

--Andreas