[SRILM User List] linear interpolation of different vocabulary language models

Marta Ruiz martaruizcostajussa at gmail.com
Fri Jan 18 05:21:36 PST 2013


The process is killed anyway... Are there any alternatives?

best regards,
Marta

On Thu, Jan 17, 2013 at 11:22 AM, Andreas Stolcke <stolcke at icsi.berkeley.edu
> wrote:

>  On 1/16/2013 6:00 PM, Marta Ruiz wrote:
>
> Hi Andreas,
>
> regarding this issue, I got the error
>
>  class definition has too many fields
>
> That means you must have a very long line in your class definitions file.
> You should have one class membership definition per line.
> If a class has many members you write one per line, for example
>
> NN    cat
> NN    dog
> NN    ball
>
> etc.
>
> Andreas
>
>
> in fact, I wanted to expand a language model of PoS tags into words...
> actually, each PoS has many words related...
>
>
> best regards,
> Marta
>
> On Wed, Jan 9, 2013 at 3:34 PM, Andreas Stolcke <stolcke at icsi.berkeley.edu
> > wrote:
>
>>  On 1/8/2013 6:07 PM, Marta Ruiz wrote:
>>
>> Thanks Andreas, two more questions
>>
>>>
>>> 1. Create a word-based version of each model.  For example, you can
>>> construct a POS-based LM and combine it with a class membership mapping (in
>>> classes-format, see man page) to get a word-level POS-based model.
>>> Similar with lemma-based LMs (the lemmas are effectively word classes).
>>>
>>>
>> which is the instruction to do this?
>>
>>
>>  1. You create the class-to-word mapping file (in the format described
>> here<http://www.speech.sri.com/projects/srilm/manpages/classes-format.5.html>)
>> to reflect either your POS-to-word or lemma-to-word mapping.
>> 2. Process the training data to replace the words with POS or lemmas, as
>> appropriate.
>> 3. Train the ngram portion of the LM by running ngram-count on the
>> training data represented as a sequence of POS tags / lemmas (from step 2).
>>
>>
>>
>>
>>
>>> 2. Then interpolate the models using
>>>
>>>     ngram -bayes 0 -lm LM1 -mix-lm LM2 -mix-lm2 LM3 .... -lambda ...
>>> -mix-lambda2 ... -classes CLASSES
>>>
>>> where CLASSES is a classes-format(5) file defining the union of all the
>>> word classes used in the various component models.
>>>
>>>
>> to find the lambdas can I use the compute-best-mix, can't I?
>>
>>  Exactly.
>>
>> Andreas
>>
>>
>
>
> --
> Marta Ruiz Costa-jussà
> martaruizcostajussa at gmail.com
> http://gps-tsc.upc.es/veu/personal/mruiz/mruiz.php3
>
>
>


-- 
Marta Ruiz Costa-jussà
martaruizcostajussa at gmail.com
http://gps-tsc.upc.es/veu/personal/mruiz/mruiz.php3
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20130118/98b31ffe/attachment.html>


More information about the SRILM-User mailing list