GT coeffs in -make-big-lm

Thu May 11 08:12:33 PDT 2006

Hi!

  When I trained a very large model (corpus size approx. 600 mln tokens), I found out a feature that look a bit odd. Since the LM is going to be huge, I'm using -make-big-lm script to calculate in a distributed way 4 partial LMs and then merge those into the resulting one. 
  After I put to calculation 4 -make-big-lm tasks, GT coefficients for the first one are output in the home directory (and then it takes some time to get that something is possibly wrong, since this output is not reported in manual), and the other running tasks are just using those, presuming GT pre-computation was done in advance. It should not seriously damage a large model, but it's good to be as precise as possible. So I had to delete GT files manually after each consequent (not simultaneous then) -make-big-lm execution, presuming n-gram merge would correctly renormalize the probabilities. Is it correct or I'd rather calculate GT coefficients from the whole .ngram file, save in the home directory and use for each partial -make-big-lm calculation?

best regards,
Ilya

---------------------------------
To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20060511/34d96e4f/attachment.html>