inquiry about SRI Toolkit
Anand Venkataraman
anand at speech.sri.com
Tue Jul 16 12:55:34 PDT 2002
Man page for compute-mixed-logprob:
compute-mixed-logprob computes the log probability of a given
corpus of text according to the best mixture of the given com-
ponent language models. The interpolation is done fairly. That
is, the given corpus is split into two sets (with alternate lines
belonging to different sets) and the mixture coefficients for
each set are those computed using EM on the other set. Upto six
language models may be specified on the command line using the
-lm flag. If the splitting of the corpus into two sets by alter-
nate line order is not the method desired, the user may expli-
citly specify two sets on the command line using -sets set1 set2
instead of giving a single -text corpus option. The -lm-flags
option may be given to supply additional options passed on to
ngram during perplexity calculations, for instance, if the
language models are class language models and a class file needs
to be specified with -classes classfile. Language model ngram
orders may also likewise be passed on to ngram using -lm-flags
'-order n'. All such options that are to be passed to ngram must
be quoted and passed to compute-mixed-logprob as a single option.
However, note that the supplied ngram options will be used for
all the language models specified.
Further, the -expt exptID option may be used to specify the pre-
fix used for all ancillary files created by the program. The
exptID may include a path and any missing directories in this
path will be created.
Final output will include the ngram outputs for each separate set
and a combined output in the same format for both sets. A log-
file of the procedure is produced in exptID.log
Examples:
compute-mixed-logprob -expt 001/mix -text swbd.txt -lm
swbd.4bo.gz -lm bn.3bo.gz -lm ch.3bo.gz -lm-flags "-order 4
-classes train400.classes"
compute-mixed-logprob -expt 001/mix -sets swbd-set1.txt swbd-
set2.txt -lm swbd.4bo.gz -lm bn.3bo.gz -lm ch.3bo.gz -lm-flags
"-order 4 -classes train.400classes"
&
More information about the SRILM-User
mailing list