<html>
<head>
<meta content="text/html; charset=utf-8" http-equiv="Content-Type">
</head>
<body bgcolor="#FFFFFF" text="#000000">
<div class="moz-cite-prefix">On 7/5/2016 8:04 AM, Bey Youcef wrote:<br>
</div>
<blockquote
cite="mid:CACJWz0T1OKP2L5AnfPGNxnJ8_6-asD9KUu41TumVV1GRnoP+aQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div>Dear all,</div>
<div><br>
</div>
<div>I'm new in this community. And Strongly interested in SMT
and NLP.</div>
<div><br>
</div>
<div>Here an example </div>
<div><a moz-do-not-send="true"
href="http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats">http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats</a></div>
<div>
<pre>
</pre>
<pre>\data\
ngram 1=7
ngram 2=7
\1-grams:
0.1 <UNK> 0.5555
0 <s> 0.4939
0.1 </s> 1.0
0.2 wood 0.5555
0.2 cindy 0.5555
0.2 pittsburgh 0.5555
0.2 jean 0.6349
\2-grams:
0.5555 <UNK> wood
0.5555 <s> <UNK>
0.5555 wood pittsburgh
0.5555 cindy jean
0.5555 pittsburgh cindy
0.2778 jean </s>
0.2778 jean wood
\end\</pre>
</div>
<div><br>
</div>
<div>Question:</div>
<div><br>
</div>
<div>1. Why "UNK" exists in ARPA after training? </div>
<div><br>
</div>
<div>As far as I know, the training corpus includes at least one
co-occurrence. Hence, after training, ARPA shouldn't contain
UNK (unknown words)</div>
</div>
</blockquote>
People often include a <UNK> token as a placeholder for words
that you might see in new data but weren't present in the training
data.<br>
To build such a model you fix a model vocabulary that does not
contain all the words in the training corpus (you typically exclude
low-frequency words) and those will be replaced by <UNK> and
its probability estimated in the usual way.<br>
<br>
<blockquote
cite="mid:CACJWz0T1OKP2L5AnfPGNxnJ8_6-asD9KUu41TumVV1GRnoP+aQ@mail.gmail.com"
type="cite">
<div dir="ltr">
<div><br>
</div>
<div>2. In n-gram metrics format</div>
<div><br>
</div>
<div><strong> 0.2 wood 0.5555</strong></div>
<div><br>
</div>
<div>There are 3 elements : log10(P) wood (Backoff weights)</div>
<div><br>
</div>
<div>How we calculate "<strong>backoff weights</strong>"
(0.5555) ?</div>
</div>
</blockquote>
<br>
See <a class="moz-txt-link-freetext" href="https://en.wikipedia.org/wiki/Katz%27s_back-off_model">https://en.wikipedia.org/wiki/Katz%27s_back-off_model</a> or <br>
<a class="moz-txt-link-freetext" href="http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html">http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html</a><br>
<br>
Andreas<br>
<br>
</body>
</html>