singleton counts warning
Andreas Stolcke
stolcke at speech.sri.com
Mon Mar 15 16:24:53 PST 2004
In message <40556B21.8080706 at irisa.fr>you wrote:
> Hi !
> I use SRILM to build a language model on letters. I have a warning that
> I don't understand : "warning: no singleton counts
> GT discounting disabled"
> So, the model computed is wrong since some back-off weight are positives
> (in log-probability) ! Do you know what does this warning mean ? I
> thought no counts on single letters were computed but they were so I
> can't find an explanation !
GT (and also KN) discounting need the number of words that appear only
once (singletons) in the training corpus. If that number is 0 the
discounting formulae for those methods cannot be applied.
Please try using a different smoothing method, such as
Witten-Bell to your letter LM, at least for the unigrams.
>
> I've got another question, about the computation of unigram
> log-probability. When I used the formula : log[P(w)] = log[c(w)] -
> log[N], where N is the number of word TOKENS in the training corpus, I
> don't find exactly the value given by SRILM. Is there smoothing on
> unigram ? And if so, how is it made ?
Of course there is smoothing. I don't have time to elaborate on
the different smoothing algorthms implemented in SRILM, but you can either
study the code in Discount.cc, or refer to the excellent survey paper
by Chen & Goodman (SEE ALSO section of the ngram-count(1) man page).
--Andreas
More information about the SRILM-User
mailing list