[SRILM User List] ARPA format
Bey Youcef
youcef.bey at gmail.com
Tue Jul 5 08:04:40 PDT 2016
Dear all,
I'm new in this community. And Strongly interested in SMT and NLP.
Here an example
http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats
\data\
ngram 1=7
ngram 2=7
\1-grams:
0.1 <UNK> 0.5555
0 <s> 0.4939
0.1 </s> 1.0
0.2 wood 0.5555
0.2 cindy 0.5555
0.2 pittsburgh 0.5555
0.2 jean 0.6349
\2-grams:
0.5555 <UNK> wood
0.5555 <s> <UNK>
0.5555 wood pittsburgh
0.5555 cindy jean
0.5555 pittsburgh cindy
0.2778 jean </s>
0.2778 jean wood
\end\
Question:
1. Why "UNK" exists in ARPA after training?
As far as I know, the training corpus includes at least one co-occurrence.
Hence, after training, ARPA shouldn't contain UNK (unknown words)
2. In n-gram metrics format
* 0.2 wood 0.5555*
There are 3 elements : log10(P) wood (Backoff weights)
How we calculate "*backoff weights*" (0.5555) ?
Thanks so much
Joseph.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20160705/b3f5611f/attachment.html>
More information about the SRILM-User
mailing list