Limited vocabulary causing "no-singletons" problem
Andreas Stolcke
stolcke at speech.sri.com
Sat Jul 7 09:48:35 PDT 2007
Use the make-big-lm script for training your LM.
(Despite the name, it works for small LMs as well.)
It will compute the GT or KN count-of-count statistics using
the unlimited vocabulary, and then apply your vocabulary in
building the LM.
--Andreas
In message <499960.43881.qm at web31612.mail.mud.yahoo.com>you wrote:
> Hi SRILM users,
> I have the following problem. I want to train a LM
> for a low-resource speech recognizer. Since the
> recognizer can only handle vocabularies with a limited
> size (N), I first must fix my vocabulary to only
> contain N most frequently occurring words from the
> training text. However, since all such words occur
> more than once in the training corpus, it seems to
> disables me from using the discounting schemes which
> rely on singleton counts.
>
> For GT discounting, ngram-count gives a warning on
> no-singletons in the training data, for KN no warning
> was printed, however, I guess the KN discounting is
> affected by the no-singletons as well. Ngram-count
> also has an option "-knn knfile" to calculate
> smoothing parameters using an unlimited vocabulary in
> advance, however, I guess this does not entirely solve
> this problem... Is it true?
>
> Is there a way how to bypass this problem using SRILM
> or do I have to use another (generally inferior)
> discounting scheme such as Witten-Bell (at least for
> counts of order 1)?
> Thanks for help,
> Mats
>
>
>
> _____________________________________________________________________________
> _______
> Take the Internet to Go: Yahoo!Go puts the Internet in your pocket: mail, new
> s, photos & more.
> http://mobile.yahoo.com/go?refer=1GNXIC
More information about the SRILM-User
mailing list