[SRILM User List] Unk vs open vocabulary

Thu Sep 22 22:37:01 PDT 2016

On 9/21/2016 8:21 AM, Dávid Nemeskey wrote:
> Hi guys,
>
> I was wondering about how <unk>, open vocabulary and discounting
> interacts in SRILM. Up till now, I have been using kndiscount models,
> but I realized that when the size of the vocabulary is limited (e.g.
> 10k words), the singleton count-of-counts might become 0, and so KN
> (as well as GT) cannot be used. I know there are other methods, but it
> made me think.
That is a known issue, and the recommended solution is to estimate the 
discounting factors BEFORE truncating the vocabulary.
That exactly what the 'make-big-lm' wrapper script does (described in 
the training-scripts(1) 
<http://www.speech.sri.com/projects/srilm/manpages/training-scripts.1.html> 
man page).

>
> What do we gain by discounting, if OOVs are mapped to <unk> anyway and
> <unk> is part of the vocabulary (as far as I understand, this is what
> -unk does)? If we apply discounting, wouldn't it just give an even
> bigger probability to <unk>, as would also get weight from all the
> other words (including itself)? Shouldn't then we just use an ML
> estimate if <unk> is part of the vocabulary?
No because you may still have individual words in your vocabulary that 
occur only once or twice in the training data, and their ML estimates 
would be too high without discounting.

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20160922/5558cbcf/attachment.html>