[SRILM User List] Question about select-vocab
Anand Venkataraman
venkataraman.anand at gmail.com
Wed Sep 5 13:05:04 PDT 2012
I realized I was off the list and just rejoined (thanks Andreas).
Meng - In response to your questions about select-vocab:
1. Yes, you're right about the PPL. The program trains separate unigram
LMs for the given corpora (A & B) and the diagnostic output prints the PPL
of the held-out set according to the _best_ word-level mixture of A.1bo and
B.1bo.
2. Hard to say how big the held-out set ought to be for given A and B
sizes. My only suggestion is to ensure that the held-out set contains a
representative sample of words that you expect to see in the domain. If in
doubt, you can always extract the domain vocabulary and ensure that the
held-out set covers the top N% (by freq) of the domain words (for some
suitable N)
Hope this helps.
&
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120905/af2908f4/attachment.html>
More information about the SRILM-User
mailing list