Class-based LM using the SRILM toolkit?

Mon Aug 6 10:55:25 PDT 2007

Madhav Krishna wrote:
> Dear Dr. Stolcke,
>
> Thank you for your email. However, we require a little more help. We
> have completed our experiments but have obtained surprising results.
>
> We trained and tested a class-based language model as per your
> instructions. We trained it on 5 training sets drawn from the same
> corpus. These sets were of sizes 300,000 words to 15,000,000 words -
> increasing in steps of 300,000 words. The testing data size was held
> constant at 400,000 sentences. When testing the 5 LMs obtained from
> the training data sets, we observed that the resulting perplexity
> values increased with increase in the size of training data. This is
> indeed contrary to popular findings. In fact, the perplexity values
> obtained were 710, 890, 1150, 1200, 1280.
>
> Could these values have occurred due to my not specifying a vocabulary
> explicitly while training the LMs? I believe that the toolkit adds all
> the words in the training data to the vocabulary by default. But then,
> how does it treat OOVs in the testing set? Also, how does the choice
> of vocabulary effect perplexity?
>   
Indeed, you cannot compare perplexities unless the LM vocabulary is 
constant across models.
That's because a large vocabulary leads to higher inherent uncertainty 
about the next word.
OOVs and words with zero probability are excluded from the perplexity 
computation, so by fixing the vocabulary you are also fixing the set of 
excluded words, again, making the comparison valid.

So, extract your vocabulary from the smallest or the largest of your 
training sets, and then train all models with -vocab VOCAB.
To handle words properly in the class-based LM you might want to stick 
all unseen words in a special class
(which you have to construct separately from ngram-class and add to the 
class definition file).

Andreas

> I would appreciate your help.
>
> Sincerely,
> Madhav Krishna
>
> On 5/30/07, Andreas Stolcke <stolcke at speech.sri.com> wrote:
>   
>>> Dear Dr. Stolcke,
>>>
>>> Thank you once again for your invaluable help.
>>>
>>> I have now developed two LMs using your toolkit - a trigram word-based model
>>> and a class-based model (static models). I now want to interpolate them and
>>> then apply some form of smoothing on the resultant LM. The ngram program in
>>> the toolkit has a -mix-lm option which allows linear interpolation; the
>>> manpages for that option mention:
>>>
>>> "*NOTE: *Unless *-bayes *(see below) is specified, *-mix-lm *triggers a
>>> static interpolation of the models in memory. In most cases a more
>>> efficient, dynamic interpolation is sufficient, requested by *-bayes
>>> 0*.**Also, mixing models of different type (
>>> e.g., word-based and class-based) will *only *work correctly with dynamic
>>> interpolation."
>>>
>>> What is dynamic interpolation? Is it applicable in my case? Can
>>>       
>> Dynamic interpolation means that the probabilities of the interpolated model
>> are computed on-the-fly, at test time.
>> Static interpolation, by contrast, means that a single model is created
>> ahead of testing, containing the interpolated probabilities in the
>> usual backoff format.  This is only possible for models of the same type,
>> as explained in the note above.
>>
>>     
>>> mixing/interpolation of these models be perfomed only with the -dynamic
>>> option? In that case, how?
>>>       
>> The -dynamic option has nothing to do with dynamic interpolation of the
>> kind we are discussing here.
>> Dynamic interpolation is enabled by the -bayes option.
>>
>>     
>>> Also, what is the -bayes interpolation method about? The manpages say for
>>> the -bayes option:
>>> "Interpolate the second and the main model using posterior probabilities for
>>> local N-gram-contexts of length *length*."
>>> What are you referring to by "N-gram contexts"? Are only the posterior
>>> probabilities interpolated here? If possible, please provide me with a link
>>> to a reference text etc. where I can learn more about this.
>>>       
>> For an explanation of Bayesian interpolation please consult the technical
>> report cited at the bottom of the ngram(1) man page.  You can get it at
>> http://www.speech.sri.com/cgi-bin/run-distill?papers/lm95-report.ps.gz
>> then check Section 2.3.
>>
>> Andreas
>>
>>
>>     
>
>
>