[SRILM User List] ARPA format
Ana
amontalvo at cenatav.co.cu
Wed Oct 19 13:27:17 PDT 2016
Hi all,
something happens when I add the -vocab option, I wonder if is a correct
behavior and if both LM are correct?
with -vocab all prob are pretty equal, and without -vocab they change
more and for 1-grams there is another prob column...
Please take a look bellow and comment something
best regards
ana
*without -vocab*
\data\
ngram 1=10819
ngram 2=58565
\1-grams:
-4.879262 . -0.3009124
-1.284759 </s>
-99 <s> -0.5989256
-1.722562 A -0.4924272
-3.040413 A. -0.4656199
-4.578232 A.'S -0.2988251
-4.879262 A.S -0.2973903
-4.335194 ABANDON -0.3181008
-4.335194 ABANDONED -0.4768775
-4.402141 ABANDONING -0.535318
-4.703171 ABBOUD -0.3001948
-4.879262 ABBREVIATED -0.3008665
-4.879262 ABERRATION -0.2933786
*
**using -vocab*
\data\
ngram 1=237764
ngram 2=55267
\1-grams:
-6.536696 !EXCLAMATION-POINT
-6.536696 "DOUBLE-QUOTE
-6.536696 %PERCENT
-6.536696 &ERSAND
-6.536696 &EM
-6.536696 &FLU
-6.536696 &NEATH
-6.536696 &SBLOOD
-6.536696 &SDEATH
-6.536696 &TIS
-6.536696 &TWAS
-6.536696 &TWEEN
-6.536696 &TWERE
-6.536696 &TWIXT
-6.536696 'AVE
-6.536696 'CAUSE
-6.536696 'COS
-6.536696 'EM
On 06/07/16 11:44, Andreas Stolcke wrote:
> On 7/6/2016 4:57 AM, Bey Youcef wrote:
>>
>> Thank you very much for your answer.
>>
>> Do you mean that before training, we should have a corpus (T) and
>> vocabulary (VOC); and replace absent words by UNK in the training
>> corpus? (I thought VOC is made from T by 1-gram)
> Yes
>>
>> In this case, how about unseen words that don't belong to VOC during
>> the evaluation ? Should we replace them by UNK and take the
>> probability already computed in the Model?
> Yes
>
> Both of these substitutions happen automatically in SRILM when you
> specify the vocabulary with -vocab and also use the -unk option.
> Other tools may do it differently. Note: SRILM uses <unk> instead
> of <UNK>.
>
>>
>> What then is smoothing for?
> Smoothing is primarily for allowing unseen ngrams (not just
> unigrams). For example, even though "mondays" occurred in the
> training data you might not have seen the ngram "i like mondays".
> Smoothing removes some probability from all the observed ngrams "i
> like ..." and gives it to unseen ngrams that start with "i like".
>
> Andreas
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20161019/3efb9c68/attachment.html>
More information about the SRILM-User
mailing list