[SRILM User List] Question about select-vocab
Andreas Stolcke
stolcke at icsi.berkeley.edu
Wed Sep 5 13:36:29 PDT 2012
On 9/5/2012 1:05 PM, Anand Venkataraman wrote:
> I realized I was off the list and just rejoined (thanks Andreas).
>
> Meng - In response to your questions about select-vocab:
>
> 1. Yes, you're right about the PPL. The program trains separate
> unigram LMs for the given corpora (A & B) and the diagnostic
> output prints the PPL of the held-out set according to the _best_
> word-level mixture of A.1bo and B.1bo.
> 2. Hard to say how big the held-out set ought to be for given A and B
> sizes. My only suggestion is to ensure that the held-out set
> contains a representative sample of words that you expect to see
> in the domain. If in doubt, you can always extract the domain
> vocabulary and ensure that the held-out set covers the top N% (by
> freq) of the domain words (for some suitable N)
>
> Hope this helps.
>
> &
>
Thanks Anand. Good to have you back on the list.
Meng: in case this wasn't clear, "PPL" is short for "perplexity".
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20120905/62b0bc09/attachment.html>
More information about the SRILM-User
mailing list