[SRILM User List] FLM Training takes too long!

Sun Oct 28 23:16:35 PDT 2012

On 10/28/2012 5:24 PM, Melvin Jose wrote:
>
>
> Hey,
>
> I am presently working with Tamil - a morphologically rich language. I 
> am trying to build an FLM with approximately 3 million entires but it 
> seems to take more than a day and a half now. The FLM specification is
>
> W : W(-1) W(-2) B(-1) S(-1) using generalized backoff. where B is 
> word-base and S is suffix.
>
> Below is the output of -debug 2
>
> warning: distributing 0.0989813 left-over probability mass over all 
> 577519 words
> discarded 1 0x4-gram probs predicting pseudo-events
> discarded 1587186 0x4-gram probs discounted to zero
> discarded 1 0x8-gram probs predicting pseudo-events
> discarded 1 0xc-gram probs predicting pseudo-events
> discarded 4721615 0xc-gram probs discnounted to zero
> Starting estimation of general graph-backoff node: LM 0 Node 0xC, 
> children: 0x8 0x4
> Finished estimation of multi-child graph-backoff node: LM 0 Node 0xC
>
> This was the last message I received a day and a half ago. Is it 
> normal for it to take soo long? I read that Katrin had no problem 
> training on 5 million entries. Did it take so long? I am using a 
> cluster in my lab to do the computation, so there shouln't be a 
> problem with memory and computational power.

I have no experience myself to tell you how long it should take.
However, in cases like this I would run some experiments increasing the 
amount of data from, say 10k to 100k to see how the runtime increases as 
a function of input size. Then you can extrapolate to the full data set 
instead of just waiting.

>
> Is there any way by which I can tell the fngram-count to utilize as 
> much memory as it wants or parallelize the computation?
It will take as much memory as it needs to, and there is no easy way to 
parallelize.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20121028/4308636a/attachment.html>