<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 7/5/2016 8:04 AM, Bey Youcef wrote:<br>

    </div>

    <blockquote

cite="mid:CACJWz0T1OKP2L5AnfPGNxnJ8_6-asD9KUu41TumVV1GRnoP+aQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div>Dear all,</div>

        <div><br>

        </div>

        <div>I'm new in this community. And Strongly interested in SMT

          and NLP.</div>

        <div><br>

        </div>

        <div>Here an example </div>

        <div><a moz-do-not-send="true"

href="http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats">http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats</a></div>

        <div>

          <pre>

</pre>

          <pre>\data\

ngram 1=7

ngram 2=7


\1-grams:

0.1 <UNK> 0.5555

0 <s>      0.4939

0.1 </s>   1.0

0.2 wood         0.5555

0.2 cindy       0.5555

0.2 pittsburgh          0.5555

0.2 jean         0.6349


\2-grams:

0.5555 <UNK> wood

0.5555 <s> <UNK>

0.5555 wood pittsburgh

0.5555 cindy jean

0.5555 pittsburgh cindy

0.2778 jean </s>

0.2778 jean wood 


\end\</pre>

        </div>

        <div><br>

        </div>

        <div>Question:</div>

        <div><br>

        </div>

        <div>1. Why "UNK" exists in ARPA after training? </div>

        <div><br>

        </div>

        <div>As far as I know, the training corpus includes at least one

          co-occurrence. Hence, after training, ARPA shouldn't contain

          UNK (unknown words)</div>

      </div>

    </blockquote>

    People often include a <UNK> token as a placeholder for words

    that you might see in new data but weren't present in the training

    data.<br>

    To build such a model you fix a model vocabulary that does not

    contain all the words in the training corpus (you typically exclude

    low-frequency words) and those will be replaced by <UNK> and

    its probability estimated in the usual way.<br>

    <br>

    <blockquote

cite="mid:CACJWz0T1OKP2L5AnfPGNxnJ8_6-asD9KUu41TumVV1GRnoP+aQ@mail.gmail.com"

      type="cite">

      <div dir="ltr">

        <div><br>

        </div>

        <div>2. In n-gram metrics format</div>

        <div><br>

        </div>

        <div><strong>             0.2 wood 0.5555</strong></div>

        <div><br>

        </div>

        <div>There are 3 elements : log10(P) wood (Backoff weights)</div>

        <div><br>

        </div>

        <div>How we calculate "<strong>backoff weights</strong>"

          (0.5555) ?</div>

      </div>

    </blockquote>

    <br>

    See <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Katz%27s_back-off_model">https://en.wikipedia.org/wiki/Katz%27s_back-off_model</a> or <br>

<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html">http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html</a><br>

    <br>

    Andreas<br>

    <br>

  </body>

</html>