[SRILM User List] ARPA format

Bey Youcef youcef.bey at gmail.com
Tue Jul 5 08:04:40 PDT 2016


Dear all,

I'm new in this community. And Strongly interested in SMT and NLP.

Here an example
http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats


\data\
ngram 1=7
ngram 2=7

\1-grams:
0.1 <UNK>	0.5555
0 <s>	 0.4939
0.1 </s>	 1.0
0.2 wood	 0.5555
0.2 cindy	0.5555
0.2 pittsburgh		0.5555
0.2 jean	 0.6349

\2-grams:
0.5555 <UNK> wood
0.5555 <s> <UNK>
0.5555 wood pittsburgh
0.5555 cindy jean
0.5555 pittsburgh cindy
0.2778 jean </s>
0.2778 jean wood

\end\


Question:

1. Why "UNK" exists in ARPA after training?

As far as I know, the training corpus includes at least one co-occurrence.
Hence, after training, ARPA shouldn't contain UNK (unknown words)

2. In n-gram metrics format

*             0.2 wood 0.5555*

There are 3 elements : log10(P) wood (Backoff weights)

How we calculate "*backoff weights*" (0.5555) ?

Thanks so much

Joseph.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20160705/b3f5611f/attachment.html>


More information about the SRILM-User mailing list