[SRILM] Some more FLM questions

Mon Oct 23 04:58:52 PDT 2006

Dear Katrin, thanks for the reply

There is a couple of other questions to those
concerned with FLM development:

1) Is there any possibility to interpolate FLMs with
normal LMs?
I tried to do this with "ngram" using "-factored" and
then "-lm-mix" options but it didn't work since it
expeted even a general (standard) word model to be
factored as well and I couldn't see how to show the
system that the first of interpolated models is
conventional, though others are factored. In "fngram"
there is no such option as well, as I get it.

2) Could you please specify how you work with large
data?
When I was training the model on 5M data, it was
taking 1.2G of memory. Actually, I work with
inflectional languages (Russian and Czech) so the
factors are really  "rich": features for each word
include its stem, inflection, detailed morphological
tag and lemma. May be that's why it takes so much
space? Otherwise I can not get how you managed to run
it for 30G words in English: in my case if I want to
enlarge data it seems like I'll have to switch to
64-bit architecture. Does SRILM and FLM support 64-bit
somehow?
If it's only me that lucky with memory loads, what
could you suggest to reduce it?

3) Which parameters does the training time depend on?

Thanks in advance,
regards,
ilya

--- Katrin Kirchhoff <katrin at ee.washington.edu> wrote:

> 
> Ilya,
> 
> We have trained FLMs with ~30M words without
> problems, but yes,
> beyond that it becomes a problem. We are currently
> working
> on updates to the code that make it possible to use
> larger
> corpora - these haven't been publicly released yet
> but
> I'll let you know when they become available.
> 
> best,
> Katrin
> 
> ilya oparin wrote:
> > Hi, everybody!
> > 
> > Does anyone have any experience of building a
> Factored Language Model on 
> > large data? There is still no problem with, say,
> processing a file in 
> > FLM format containing 5 mln entries, but as far as
> I try to feed a 50 
> > mln FLM corpus, it needs unfeasible amount of
> memory (since it loads 
> > everything in memory).
> > 
> > Does anyone know if there are any tricks how to
> train an FLM model in 
> > this case? Something like building partial LMs and
> then merging with 
> > standard ngram-count... What could you suggest as
> a solution?
> > 
> > 
> > best regards,
> > Ilya
> > 
> >
>
------------------------------------------------------------------------
> > Try the all-new Yahoo! Mail 
> >
>
<http://us.rd.yahoo.com/mail/uk/taglines/default/nowyoucan/wall_st_2/*http://us.rd.yahoo.com/evt=40565/*http://uk.docs.yahoo.com/nowyoucan.html>
> 
> > . "The New Version is radically easier to use" –
> The Wall Street Journal
> 
> 

best regards,
Ilya

Send instant messages to your online friends http://uk.messenger.yahoo.com