[SRILM User List] Usage of make-big-lm and -interpolate option
Andreas Stolcke
stolcke at icsi.berkeley.edu
Fri Jun 27 14:06:37 PDT 2014
On 6/27/2014 7:46 AM, Stefan Fischer wrote:
> Thanks for your reply!
>
> There is one thing I don't understand:
> The training.txt file contains 857661 words and there are 4848 OOVs
> that all occurr only once.
>
> So, OOVs make up 0.00565% of training.txt.
> If I use ngram-count directly, p(<unk>) is 0.00600, which is close to
> the actual percentage.
> If I use ngram-count + make-big-lm, p(<unk>) is 0.03206, which is 5
> times higher than the actual percentage.
The main difference for your purposes between ngram-count and
make-big-lm is that the latter computes the discounting parameters from
the entire vocabulary.
ngram-count limits the vocabulary (according the -vocab option, which
I'm assuming you're using) upon reading the counts and then estimates
the discounting parameters.
This different affects how much probability mass is held back from the
unigram estimates and distributed over all words.
There should be a message about that in the output. You can compare
them to see if that explains the difference.
Andreas
More information about the SRILM-User
mailing list