Limited vocabulary causing "no-singletons" problem

Sat Jul 7 09:48:35 PDT 2007

Use the make-big-lm script for training your LM.
(Despite the name, it works for small LMs as well.)

It will compute the GT or KN count-of-count statistics using 
the unlimited vocabulary, and then apply your vocabulary in
building the LM.

--Andreas

In message <499960.43881.qm at web31612.mail.mud.yahoo.com>you wrote:
> Hi SRILM users,
>  I have the following problem. I want to train a LM
> for a low-resource speech recognizer. Since the
> recognizer can only handle vocabularies with a limited
> size (N), I first must fix my vocabulary to only
> contain N most frequently occurring words from the
> training text. However, since all such words occur
> more than once in the training corpus, it seems to
> disables me from using the discounting schemes which
> rely on singleton counts. 
> 
> For GT discounting, ngram-count gives a warning on
> no-singletons in the training data, for KN no warning
> was printed, however, I guess the KN discounting is
> affected by the no-singletons as well. Ngram-count
> also has an option "-knn knfile" to calculate
> smoothing parameters using an unlimited vocabulary in
> advance, however, I guess this does not entirely solve
> this problem... Is it true?
> 
>  Is there a way how to bypass this problem using SRILM
> or do I have to use another (generally inferior)
> discounting scheme such as Witten-Bell (at least for
> counts of order 1)?

> Thanks for help,
>  Mats
> 
> 
>        
> _____________________________________________________________________________
> _______
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, new
> s, photos & more. 
> http://mobile.yahoo.com/go?refer=1GNXIC