[SRILM User List] Usage of make-big-lm and -interpolate option
Stefan Fischer
sfischer at ymail.com
Fri Jun 27 07:46:41 PDT 2014
Thanks for your reply!
There is one thing I don't understand:
The training.txt file contains 857661 words and there are 4848 OOVs
that all occurr only once.
So, OOVs make up 0.00565% of training.txt.
If I use ngram-count directly, p(<unk>) is 0.00600, which is close to
the actual percentage.
If I use ngram-count + make-big-lm, p(<unk>) is 0.03206, which is 5
times higher than the actual percentage.
Do you have any explanation for that? It seems counter-intuitive ...
Is my corpus large enough for option -kndiscount?
Regards,
Stefan
2014-06-16 4:07 GMT+02:00 Andreas Stolcke <stolcke at icsi.berkeley.edu>:
> On 06/13/2014 12:16 PM, Stefan Fischer wrote:
>>
>> Hello,
>>
>> I read that using make-big-lm is preferable to using ngram-count directly.
>> Even though my corpus is not very big, how do I switch from
>> ngram-count to make-big-lm?
>>
>> This is what I'm using so far:
>> ngram-count -order 3 -kndiscount -interpolate -unk -text
>> training.txt -vocab at_least_twice.txt -lm lm.arpa
>>
>> Is this the right way to use make-big-lm?
>> Do I have to pass more options to ngram-count if am only interested in
>> counts?
>> ngram-count -write counts.gz -text training.txt
>> make-big-lm -read counts.gz -order 3 -kndiscount -interpolate -unk
>> -text training.txt -vocab at_least_twice.txt -lm lm.arpa
>
> You did it right.
>
>
>>
>> My second question is w.r.t. to the -interpolate option.
>> I get the following warning several times:
>> warning: 2.01524e-06 backoff probability mass left for ". dunno" --
>> disabling interpolation
>> Is this just for my informtion or is it a sign of using bad parameters?
>
> It's just for information. Sometimes there is no backoff probability mass
> left for lower-order ngram estimates, and it doesn't make sense to apply
> interpolation in that case, so the code falls back on standard KN smoothing
> (without interpolation).
>
> Andreas
>
More information about the SRILM-User
mailing list