Question about using SRI with Large Data

Andreas Stolcke stolcke at speech.sri.com
Mon Mar 26 10:38:12 PDT 2007


Ibrahim Zaghloul wrote:
> Dear Eng. Andreas
>
> I am trying to use SRI LM with a counts file that is 5 GB, but I failed
> with all the ways. I got this counts by using the vocab option to limit
> the counts. I generated 8 sorted files as my data was divided to 8
> parts and then used ngram-merge to merge them. The result was the above
> file 5 GB.
> I tried to use the ordinal command:
> ngram-count -read ngram-file -lm output-lm-file
> but the result was a long error ending with Assertion 'body !=0' failed
> I tried to use this command
> make-big-lm -read ngrams-file -lm lm-file
>     but also the above error was the result.
> Also I tried to use the -gtNmin option, but also recieved the above
> error.
Please check $SRILM/doc/FAQ for a list of measures to try.  If none of 
them work then you just have too much
data and too little memory, and need to get a larger machine.  Note that 
you should ALWAYS succeed by raising
the minimum counts sufficiently.  The exact values will depend on your 
data and the amount of memory you have.
>
> When I tried to use make-google-ngrams, the result was the error:
> "/sri/bin/make-google-ngrams gzip=0 cna.ngrams
> sort: invalid option -- 2"
>
make-google-ngrams not the right tool for this problem.

Andreas





More information about the SRILM-User mailing list