ngram-count -read performance difference for tokens that start with different characters

Sat Apr 12 00:03:54 PDT 2008

> 
> Dear SRILM List Members,
> 
> I am using / augmenting SRILM for our own language modeling purposes. One
> decision that I make is to separate language models for different types of
> tokens. In my corpus, one type of token starts with a '+' character, whereas
> another does not. The difference between these is that although their counts
> are exactly the same and their respective count files, language models
> generated by them have similar sizes, I am observing significant differences
> in their respective performances in running the ngram-count command.
> 
> For instance, the tokens that does not start with a '+' may finish creating
> a language model for a training data count file by using ngram-count in 6
> seconds (by using the  -read  option), whereas the other one would finish in
> 42 seconds. Thus there seems to be a 6-7 times difference in ngram-count
> performance using count files generated for tokens that start with a '+' and
> for the ones that do not.
> 
> I am curious if there is an internal decision that prevents model building
> procedure for tokens that start with a '+' character to perform as fast as
> tokens of other types. What might be causing this performance difference?
> 
> Thanks,
> Ergun
> 

Ergun,

your problem has nothing to do with the characters in your words.
The problem is in the counts themselves.  Your two counts files have 
different count values, and that is all that matters.

The counts containing the '+' characters has a peculiar distribution of 
unigram counts (after applying the KN discounting).  In interpolated
discounting the uniform distribution is added to the unigram estimates;
for some reason in this case this makes the probabilities sum to something > 1.
This triggers a "counter-measure" that successively increments the denominator
in the estimator, and in this case this has to be repeated many, many times
to yield a proper unigram probability distribution. Hence the long run time.

There are two ways to fix this.  Just avoid using interpolated KN discounting
for the unigrams.  Instead of -interpolate, use

		-interpolate2 -interpolate3

or download the updated SRILM beta release, which has an automatic fix for
this problem.

Andreas