ngram-count -read performance difference for tokens that start with different characters

Wed Apr 9 06:57:01 PDT 2008

Dear SRILM List Members,

I am using / augmenting SRILM for our own language modeling purposes. One
decision that I make is to separate language models for different types of
tokens. In my corpus, one type of token starts with a '+' character, whereas
another does not. The difference between these is that although their counts
are exactly the same and their respective count files, language models
generated by them have similar sizes, I am observing significant differences
in their respective performances in running the ngram-count command.

For instance, the tokens that does not start with a '+' may finish creating
a language model for a training data count file by using ngram-count in 6
seconds (by using the  -read  option), whereas the other one would finish in
42 seconds. Thus there seems to be a 6-7 times difference in ngram-count
performance using count files generated for tokens that start with a '+' and
for the ones that do not.

I am curious if there is an internal decision that prevents model building
procedure for tokens that start with a '+' character to perform as fast as
tokens of other types. What might be causing this performance difference?

Thanks,
Ergun

end

-- 
Ergun Bicici
Koc University
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20080409/c9be8ead/attachment.html>