[SRILM User List] ARPA format
Andreas Stolcke
stolcke at icsi.berkeley.edu
Tue Jul 5 10:25:37 PDT 2016
On 7/5/2016 8:04 AM, Bey Youcef wrote:
> Dear all,
>
> I'm new in this community. And Strongly interested in SMT and NLP.
>
> Here an example
> http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats
> \data\
> ngram 1=7
> ngram 2=7
>
> \1-grams:
> 0.1 <UNK> 0.5555
> 0 <s> 0.4939
> 0.1 </s> 1.0
> 0.2 wood 0.5555
> 0.2 cindy 0.5555
> 0.2 pittsburgh 0.5555
> 0.2 jean 0.6349
>
> \2-grams:
> 0.5555 <UNK> wood
> 0.5555 <s> <UNK>
> 0.5555 wood pittsburgh
> 0.5555 cindy jean
> 0.5555 pittsburgh cindy
> 0.2778 jean </s>
> 0.2778 jean wood
>
> \end\
>
> Question:
>
> 1. Why "UNK" exists in ARPA after training?
>
> As far as I know, the training corpus includes at least one
> co-occurrence. Hence, after training, ARPA shouldn't contain UNK
> (unknown words)
People often include a <UNK> token as a placeholder for words that you
might see in new data but weren't present in the training data.
To build such a model you fix a model vocabulary that does not contain
all the words in the training corpus (you typically exclude
low-frequency words) and those will be replaced by <UNK> and its
probability estimated in the usual way.
>
> 2. In n-gram metrics format
>
> * 0.2 wood 0.5555*
>
> There are 3 elements : log10(P) wood (Backoff weights)
>
> How we calculate "*backoff weights*" (0.5555) ?
See https://en.wikipedia.org/wiki/Katz%27s_back-off_model or
http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20160705/f94093b7/attachment.html>
More information about the SRILM-User
mailing list