[SRILM User List] Unk vs open vocabulary
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Sep 22 22:37:01 PDT 2016
On 9/21/2016 8:21 AM, Dávid Nemeskey wrote:
> Hi guys,
>
> I was wondering about how <unk>, open vocabulary and discounting
> interacts in SRILM. Up till now, I have been using kndiscount models,
> but I realized that when the size of the vocabulary is limited (e.g.
> 10k words), the singleton count-of-counts might become 0, and so KN
> (as well as GT) cannot be used. I know there are other methods, but it
> made me think.
That is a known issue, and the recommended solution is to estimate the
discounting factors BEFORE truncating the vocabulary.
That exactly what the 'make-big-lm' wrapper script does (described in
the training-scripts(1)
<http://www.speech.sri.com/projects/srilm/manpages/training-scripts.1.html>
man page).
>
> What do we gain by discounting, if OOVs are mapped to <unk> anyway and
> <unk> is part of the vocabulary (as far as I understand, this is what
> -unk does)? If we apply discounting, wouldn't it just give an even
> bigger probability to <unk>, as would also get weight from all the
> other words (including itself)? Shouldn't then we just use an ML
> estimate if <unk> is part of the vocabulary?
No because you may still have individual words in your vocabulary that
occur only once or twice in the training data, and their ML estimates
would be too high without discounting.
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20160922/5558cbcf/attachment.html>
More information about the SRILM-User
mailing list