<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 9/21/2016 8:21 AM, Dávid Nemeskey
wrote:<br>
</div>
<blockquote
cite="mid:CAHOrvWfbVYsAe2COWoHvovgvay75+YCAos0VmSetcOa61FgnHg@mail.gmail.com"
type="cite">
<pre wrap="">Hi guys,
I was wondering about how <unk>, open vocabulary and discounting
interacts in SRILM. Up till now, I have been using kndiscount models,
but I realized that when the size of the vocabulary is limited (e.g.
10k words), the singleton count-of-counts might become 0, and so KN
(as well as GT) cannot be used. I know there are other methods, but it
made me think.</pre>
</blockquote>
That is a known issue, and the recommended solution is to estimate
the discounting factors BEFORE truncating the vocabulary.<br>
That exactly what the 'make-big-lm' wrapper script does (described
in the <a
href="http://www.speech.sri.com/projects/srilm/manpages/training-scripts.1.html">training-scripts(1)</a>
man page).<br>
<br>
<blockquote
cite="mid:CAHOrvWfbVYsAe2COWoHvovgvay75+YCAos0VmSetcOa61FgnHg@mail.gmail.com"
type="cite">
<pre wrap="">
What do we gain by discounting, if OOVs are mapped to <unk> anyway and
<unk> is part of the vocabulary (as far as I understand, this is what
-unk does)? If we apply discounting, wouldn't it just give an even
bigger probability to <unk>, as would also get weight from all the
other words (including itself)? Shouldn't then we just use an ML
estimate if <unk> is part of the vocabulary?</pre>
</blockquote>
No because you may still have individual words in your vocabulary
that occur only once or twice in the training data, and their ML
estimates would be too high without discounting.<br>
<br>
Andreas<br>
<br>
</body>
</html>