[SRILM User List] linear interpolation of different vocabulary language models

Tue Jan 8 12:15:57 PST 2013

On 1/8/2013 3:15 AM, Marta Ruiz wrote:
> Dear all,
>
> How can I interpolate language models built on the same text but with 
> different vocabularies. I mean, I have a text with words, lemmas and PoS,
> how can I interpolate the language models.
You cannot interpolate models that use different types of vocabularies. 
(You could interpolate models that are all word-based but where there 
are differences in the sets of words occurring in the component models.  
The words that are not occurring in some submodel would implicitly have 
probability zero in that submodel).

So what you need to do is:

1. Create a word-based version of each model.  For example, you can 
construct a POS-based LM and combine it with a class membership mapping 
(in classes-format, see man page) to get a word-level POS-based model.   
Similar with lemma-based LMs (the lemmas are effectively word classes).

2. Then interpolate the models using

     ngram -bayes 0 -lm LM1 -mix-lm LM2 -mix-lm2 LM3 .... -lambda ... 
-mix-lambda2 ... -classes CLASSES

where CLASSES is a classes-format(5) file defining the union of all the 
word classes used in the various component models.

Andreas