[SRILM User List] Question about select-vocab

Anand Venkataraman venkataraman.anand at gmail.com
Wed Sep 5 13:05:04 PDT 2012


I realized I was off the list and just rejoined (thanks Andreas).

Meng - In response to your questions about select-vocab:

   1. Yes, you're right about the PPL. The program trains separate unigram
   LMs for the given corpora (A & B) and the diagnostic output prints the PPL
   of the held-out set according to the _best_ word-level mixture of A.1bo and
   B.1bo.
   2. Hard to say how big the held-out set ought to be for given A and B
   sizes. My only suggestion is to ensure that the held-out set contains a
   representative sample of words that you expect to see in the domain. If in
   doubt, you can always extract the domain vocabulary and ensure that the
   held-out set covers the top N% (by freq) of the domain words (for some
   suitable N)

Hope this helps.

&
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120905/af2908f4/attachment.html>


More information about the SRILM-User mailing list