[SRILM User List] ARPA format

Tue Jul 5 10:25:37 PDT 2016

On 7/5/2016 8:04 AM, Bey Youcef wrote:
> Dear all,
>
> I'm new in this community. And Strongly interested in SMT and NLP.
>
> Here an example
> http://cmusphinx.sourceforge.net/wiki/sphinx4:standardgrammarformats
> \data\
> ngram 1=7
> ngram 2=7
>
> \1-grams:
> 0.1 <UNK>	0.5555
> 0 <s>	 0.4939
> 0.1 </s>	 1.0
> 0.2 wood	 0.5555
> 0.2 cindy	0.5555
> 0.2 pittsburgh		0.5555
> 0.2 jean	 0.6349
>
> \2-grams:
> 0.5555 <UNK> wood
> 0.5555 <s> <UNK>
> 0.5555 wood pittsburgh
> 0.5555 cindy jean
> 0.5555 pittsburgh cindy
> 0.2778 jean </s>
> 0.2778 jean wood
>
> \end\
>
> Question:
>
> 1. Why "UNK" exists in ARPA after training?
>
> As far as I know, the training corpus includes at least one 
> co-occurrence. Hence, after training, ARPA shouldn't contain UNK 
> (unknown words)
People often include a <UNK> token as a placeholder for words that you 
might see in new data but weren't present in the training data.
To build such a model you fix a model vocabulary that does not contain 
all the words in the training corpus (you typically exclude 
low-frequency words) and those will be replaced by <UNK> and its 
probability estimated in the usual way.

>
> 2. In n-gram metrics format
>
> *             0.2 wood 0.5555*
>
> There are 3 elements : log10(P) wood (Backoff weights)
>
> How we calculate "*backoff weights*" (0.5555) ?

See https://en.wikipedia.org/wiki/Katz%27s_back-off_model or
http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20160705/f94093b7/attachment.html>