[SRILM User List] Perplexity of ngram-class during inducing

Andreas Stolcke stolcke at speech.sri.com
Mon Apr 12 14:38:39 PDT 2010


> 
> Dear Dr. Andreas
> 
> I have a question regarding to the perplexity of ngram-class.
> 
> The command I used was: ngram-class -debug 2 -text TEXT -vocab VOCAB
> -numclasses NUM -classes OUTPUT
> 
> The output file will contain a perplexity and PPL1 inside, what does the
> perplexity stands for in class inducing? It seems that such perplexity
> was calculated during the class clustering process (merging), but what
> are the parameters it uses (e.g. -text and -lm)?
> 
> In the manual, it said that "...minimize perplexity of a class-based
> N-gram model given the provided word N-gram count". But to my
> understanding, there are few steps needed to use the class-based N-gram
> model:

What the manual page says refers to the likelihood of the TRAINING corpus,
(maximizing likelihood = minimizing perplexity)
not some test corpus.  To compute the test perplexity you indeed have to
go through the step you list.

Andreas

> 
> (a) use ngram-class to induce classes
> (b) use replace-words-with-classes to replace both the TEXT and VOCAB
> (c) follow the same method we used to estimate n-gram word-based model
> LM, in order to get the class-based model LM, which will give us P(C_i |
> C_i-2 C_i-1 ...)
> (d) use this LM to calculate the perplexity: ngram -ppl TEST_SET -lm LM
> -class CLASS_DEFINITION, which give us P( wi | ci )
> 
> Is the perplexity in ngram-class correlates with the perplexity in step
> (d)? Or where could I get more detail definition about it?
> 
> Thanks for your help in advance.
> 
> Best Regards
> 
> Tzu-Chiang
> 
> --------------030101000707000207050700
> Content-Type: text/html; charset=Big5
> Content-Transfer-Encoding: 7bit
> 
> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
> <html>
> <head>
> 
> <meta http-equiv="content-type" content="text/html; charset=Big5">
> </head>
> <body bgcolor="#ffffff" text="#000000">
> <meta charset="utf-8">
> <span class="Apple-style-span"
>  style="border-collapse: separate; color: rgb(0, 0, 0); font-family: Times; font-style: normal; font-variant: normal; font-weight: normal; letter-spacing: normal; line-heig
> ht: normal; orphans: 2; text-indent: 0px; text-transform: none; white-space: normal; widows: 2; word-spacing: 0px; font-size: medium;"><span
>  class="Apple-style-span"
>  style="border-collapse: collapse; font-family: arial,helvetica,sans-serif; font-size: 13px;">Dear
> Dr. Andreas<br>
> <br>
> I have a question regarding to the perplexity of ngram-class.<br>
> <br>
> The command I used was: ngram-class -debug 2 -text TEXT -vocab VOCAB
> -numclasses NUM -classes OUTPUT<br>
> <br>
> The output file will contain a perplexity and PPL1 inside, what does
> the perplexity stands for in class inducing? It seems that such
> perplexity was calculated during the class clustering process
> (merging), but what are the parameters it uses (e.g. -text and -lm)?<br>
> <br>
> In the manual, it said that "...minimize perplexity of a class-based
> N-gram model given the provided word N-gram count". But to my
> understanding, there are few steps needed to use the class-based N-gram
> model:<br>
> <br>
> (a) use ngram-class to induce classes<br>
> (b) use replace-words-with-classes to replace both the TEXT and VOCAB<br>
> (c) follow the same method we used to estimate n-gram word-based model
> LM, in order to get the class-based model LM, which will give us P(C_i
> | C_i-2 C_i-1 ...)<br>
> (d) use this LM to calculate the perplexity: ngram -ppl TEST_SET -lm LM
> -class CLASS_DEFINITION, which give us P( wi | ci )<br>
> <br>
> Is the perplexity in ngram-class correlates with the perplexity in step
> (d)? Or where could I get more detail definition about it?<br>
> <br>
> Thanks for your help in advance.<br>
> <br>
> Best Regards<br>
> <br>
> Tzu-Chiang<br>
> </span></span>
> </body>
> </html>
> 
> --------------030101000707000207050700--
> 
> --===============0113810189==
> Content-Type: text/plain; charset="us-ascii"
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Content-Disposition: inline
> 
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://www.speech.sri.com/mailman/listinfo/srilm-user
> --===============0113810189==--



More information about the SRILM-User mailing list