GT coeffs in -make-big-lm

Thu May 11 19:55:52 PDT 2006

> Hi!
>    
>   When I trained a very large model (corpus size approx. 600 mln tokens), I f
> ound out a feature that look a bit odd. Since the LM is going to be huge, I'm
>  using -make-big-lm script to calculate in a distributed way 4 partial LMs an
> d then merge those into the resulting one. 
>   After I put to calculation 4 -make-big-lm tasks, GT coefficients for the fi
> rst one are output in the home directory (and then it takes some time to get 
> that something is possibly wrong, since this output is not reported in manual
> ), and the other running tasks are just using those, presuming GT pre-computa
> tion was done in advance. It should not seriously damage a large model, but i
> t's good to be as precise as possible. So I had to delete GT files manually a
> fter each consequent (not simultaneous then) -make-big-lm execution, presumin
> g n-gram merge would correctly renormalize the probabilities. Is it correct o
> r I'd rather calculate GT coefficients from the whole .ngram file, save in th
> e home directory and use for each partial -make-big-lm calculation?

It is true make-big-lm saves the statistics needed for count smoothing
in files, so that if you rerun the script they are not recomputed 
(since this step is potentially expensive).  I'm sorry this is not 
documented well.

However, the filenames are keyed to the values of the "-name" option.
so if you want to do several runs in the same directory just specify 
a separate -name parameter in each case.

--Andreas