[SRILM User List] Usage of make-big-lm and -interpolate option

Fri Jun 27 14:06:37 PDT 2014

On 6/27/2014 7:46 AM, Stefan Fischer wrote:
> Thanks for your reply!
>
> There is one thing I don't understand:
> The training.txt file contains 857661 words and there are 4848 OOVs
> that all occurr only once.
>
> So, OOVs make up 0.00565% of training.txt.
> If I use ngram-count directly, p(<unk>) is 0.00600, which is close to
> the actual percentage.
> If I use ngram-count + make-big-lm, p(<unk>) is 0.03206, which is 5
> times higher than the actual percentage.

The main difference for your purposes between ngram-count and 
make-big-lm is that the latter computes the discounting parameters from 
the entire vocabulary.
ngram-count limits the vocabulary (according the -vocab option, which 
I'm assuming you're using) upon reading the counts and then estimates 
the discounting parameters.

This different affects how much probability mass is held back from the 
unigram estimates and distributed over all words.
There should be a message about that in the output.  You can compare 
them to see if that explains the difference.

Andreas