[SRILM User List] Question about select-vocab

Andreas Stolcke stolcke at icsi.berkeley.edu
Wed Sep 5 13:36:29 PDT 2012


On 9/5/2012 1:05 PM, Anand Venkataraman wrote:
> I realized I was off the list and just rejoined (thanks Andreas).
>
> Meng - In response to your questions about select-vocab:
>
>  1. Yes, you're right about the PPL. The program trains separate
>     unigram LMs for the given corpora (A & B) and the diagnostic
>     output prints the PPL of the held-out set according to the _best_
>     word-level mixture of A.1bo and B.1bo.
>  2. Hard to say how big the held-out set ought to be for given A and B
>     sizes. My only suggestion is to ensure that the held-out set
>     contains a representative sample of words that you expect to see
>     in the domain. If in doubt, you can always extract the domain
>     vocabulary and ensure that the held-out set covers the top N% (by
>     freq) of the domain words (for some suitable N)
>
> Hope this helps.
>
> &
>
Thanks Anand.  Good to have you back on the list.

Meng:  in case this wasn't clear, "PPL" is short for "perplexity".

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120905/62b0bc09/attachment.html>


More information about the SRILM-User mailing list