[SRILM User List] Unk vs open vocabulary

Wed Sep 21 08:21:19 PDT 2016

Hi guys,

I was wondering about how <unk>, open vocabulary and discounting
interacts in SRILM. Up till now, I have been using kndiscount models,
but I realized that when the size of the vocabulary is limited (e.g.
10k words), the singleton count-of-counts might become 0, and so KN
(as well as GT) cannot be used. I know there are other methods, but it
made me think.

What do we gain by discounting, if OOVs are mapped to <unk> anyway and
<unk> is part of the vocabulary (as far as I understand, this is what
-unk does)? If we apply discounting, wouldn't it just give an even
bigger probability to <unk>, as would also get weight from all the
other words (including itself)? Shouldn't then we just use an ML
estimate if <unk> is part of the vocabulary?

Thanks,
Dávid