<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 9/21/2016 8:21 AM, Dávid Nemeskey

      wrote:<br>

    </div>

    <blockquote

cite="mid:CAHOrvWfbVYsAe2COWoHvovgvay75+YCAos0VmSetcOa61FgnHg@mail.gmail.com"

      type="cite">

      <pre wrap="">Hi guys,

I was wondering about how <unk>, open vocabulary and discounting

interacts in SRILM. Up till now, I have been using kndiscount models,

but I realized that when the size of the vocabulary is limited (e.g.

10k words), the singleton count-of-counts might become 0, and so KN

(as well as GT) cannot be used. I know there are other methods, but it

made me think.</pre>

    </blockquote>

    That is a known issue, and the recommended solution is to estimate

    the discounting factors BEFORE truncating the vocabulary.<br>

    That exactly what the 'make-big-lm' wrapper script does (described

    in the <a

href="http://www.speech.sri.com/projects/srilm/manpages/training-scripts.1.html">training-scripts(1)</a>

    man page).<br>

    <br>

    <blockquote

cite="mid:CAHOrvWfbVYsAe2COWoHvovgvay75+YCAos0VmSetcOa61FgnHg@mail.gmail.com"

      type="cite">

      <pre wrap="">

What do we gain by discounting, if OOVs are mapped to <unk> anyway and

<unk> is part of the vocabulary (as far as I understand, this is what

-unk does)? If we apply discounting, wouldn't it just give an even

bigger probability to <unk>, as would also get weight from all the

other words (including itself)? Shouldn't then we just use an ML

estimate if <unk> is part of the vocabulary?</pre>

    </blockquote>

    No because you may still have individual words in your vocabulary

    that occur only once or twice in the training data, and their ML

    estimates would be too high without discounting.<br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>