[SRILM User List] Question on SRILM Toolkit

Andreas Stolcke stolcke at speech.sri.com
Thu Sep 10 08:25:54 PDT 2009


Saeedeh Momtazi wrote:
> Dear Andreas Stolcke,
>
> I, Saeedeh Momtazi, use the SRILM toolkit for a while. The main part 
> that I use from this toolkit is "ngram-class". So far, I had no 
> problem with this toolkit. However, recently I tried to cluster the 
> terms that I have based on a count file which is about 6 GB. I faced 
> an error message that I copy here:
>
> ngram-class: ../../include/LHash.cc:138: void LHash<KeyT, 
> DataT>::alloc(unsigned int) [with KeyT = unsigned int, DataT = 
> Trie<unsigned int,  long unsigned int>]: Assertion `body != 0' failed.
> /var/torque/mom_priv/jobs/53195.maste.SC <http://53195.maste.SC>: line 
> 39: 25464 Aborted
You are simply running out of memory.  You need more memory or swap 
space, and probably you need to switch
to a 64bit machine.  However, first you should make sure to use the 
memory-optimized version of the tools (compiled with make OPTION=_c).

You can always sample your data, or simply prune the count file by 
eliminating low-count ngrams.  This might not change your results much.  
When inducing word classes the words with low counts are not handled 
robustly anyway.  I found it best to replace all words with low counts 
with an "Infrequent word" class label ahead of time. As a by product, 
this will dramatically reduce the number of distinct bigrams because 
most of the bigrams involve rare words (Zipf's law etc.).

Andreas
>
>
> I appreciate in advance if you let me know how I can solve this problem.
> To be more precise, my vocabulary is about 35000 words and I want to 
> cluster them into 3000 classes. The input items that I use when 
> calling ngram-class are the vocab file (-vocab), the count file 
> (-counts) and the number of classes (-numclasses). The only output 
> that I need is a mapping between words and classes (-classes).
>
>
> Looking forward to hearing from you.
>
> Thanks in advance,
> Saeedeh Momtazi



More information about the SRILM-User mailing list