[SRILM User List] ARPA format

Wed Jul 6 08:44:24 PDT 2016

On 7/6/2016 4:57 AM, Bey Youcef wrote:
>
> Thank you very much for your answer.
>
> Do you mean that before training, we should have a corpus (T) and 
> vocabulary (VOC); and replace absent words by UNK in the training 
> corpus? (I thought VOC is made from T by 1-gram)
Yes
>
> In this case, how about unseen words that don't belong to VOC during 
> the evaluation ? Should we replace them by UNK and take the 
> probability already computed in the Model?
Yes

Both of these substitutions happen automatically in SRILM when you 
specify the vocabulary with -vocab and also use the -unk option.
Other tools may do it differently.   Note:  SRILM uses <unk> instead of 
<UNK>.

>
> What then is smoothing for?
Smoothing is primarily for allowing unseen ngrams (not just unigrams).   
For example, even though "mondays" occurred in the training data you 
might not have seen the ngram "i like mondays". Smoothing removes some 
probability from all the observed ngrams "i like ..."  and gives it to 
unseen ngrams that start with "i like".

Andreas