Limited vocabulary causing "no-singletons" problem

Sat Jul 7 09:06:51 PDT 2007

Hi SRILM users,
 I have the following problem. I want to train a LM
for a low-resource speech recognizer. Since the
recognizer can only handle vocabularies with a limited
size (N), I first must fix my vocabulary to only
contain N most frequently occurring words from the
training text. However, since all such words occur
more than once in the training corpus, it seems to
disables me from using the discounting schemes which
rely on singleton counts. 

For GT discounting, ngram-count gives a warning on
no-singletons in the training data, for KN no warning
was printed, however, I guess the KN discounting is
affected by the no-singletons as well. Ngram-count
also has an option "-knn knfile" to calculate
smoothing parameters using an unlimited vocabulary in
advance, however, I guess this does not entirely solve
this problem... Is it true?

 Is there a way how to bypass this problem using SRILM
or do I have to use another (generally inferior)
discounting scheme such as Witten-Bell (at least for
counts of order 1)?

Thanks for help,
 Mats

____________________________________________________________________________________
Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, news, photos & more. 
http://mobile.yahoo.com/go?refer=1GNXIC