[SRILM User List] ARPA format
stolcke at icsi.berkeley.edu
Wed Jul 6 08:44:24 PDT 2016
On 7/6/2016 4:57 AM, Bey Youcef wrote:
> Thank you very much for your answer.
> Do you mean that before training, we should have a corpus (T) and
> vocabulary (VOC); and replace absent words by UNK in the training
> corpus? (I thought VOC is made from T by 1-gram)
> In this case, how about unseen words that don't belong to VOC during
> the evaluation ? Should we replace them by UNK and take the
> probability already computed in the Model?
Both of these substitutions happen automatically in SRILM when you
specify the vocabulary with -vocab and also use the -unk option.
Other tools may do it differently. Note: SRILM uses <unk> instead of
> What then is smoothing for?
Smoothing is primarily for allowing unseen ngrams (not just unigrams).
For example, even though "mondays" occurred in the training data you
might not have seen the ngram "i like mondays". Smoothing removes some
probability from all the observed ngrams "i like ..." and gives it to
unseen ngrams that start with "i like".
More information about the SRILM-User