[SRILM User List] ARPA format

Ana amontalvo at cenatav.co.cu
Wed Oct 19 13:27:17 PDT 2016


Hi all,

something happens when I add the -vocab option, I wonder if is a correct 
behavior and if both LM are correct?

with -vocab all prob are pretty equal, and without -vocab they change 
more and for 1-grams there is another prob column...

Please take a look bellow and comment something

best regards

ana


*without -vocab*

\data\
ngram 1=10819
ngram 2=58565

\1-grams:
-4.879262    .    -0.3009124
-1.284759    </s>
-99    <s>    -0.5989256
-1.722562    A    -0.4924272
-3.040413    A.    -0.4656199
-4.578232    A.'S    -0.2988251
-4.879262    A.S    -0.2973903
-4.335194    ABANDON    -0.3181008
-4.335194    ABANDONED    -0.4768775
-4.402141    ABANDONING    -0.535318
-4.703171    ABBOUD    -0.3001948
-4.879262    ABBREVIATED    -0.3008665
-4.879262    ABERRATION    -0.2933786

*
**using -vocab*

\data\
ngram 1=237764
ngram 2=55267

\1-grams:
-6.536696    !EXCLAMATION-POINT
-6.536696    "DOUBLE-QUOTE
-6.536696    %PERCENT
-6.536696    &AMPERSAND
-6.536696    &EM
-6.536696    &FLU
-6.536696    &NEATH
-6.536696    &SBLOOD
-6.536696    &SDEATH
-6.536696    &TIS
-6.536696    &TWAS
-6.536696    &TWEEN
-6.536696    &TWERE
-6.536696    &TWIXT
-6.536696    'AVE
-6.536696    'CAUSE
-6.536696    'COS
-6.536696    'EM


On 06/07/16 11:44, Andreas Stolcke wrote:
> On 7/6/2016 4:57 AM, Bey Youcef wrote:
>>
>> Thank you very much for your answer.
>>
>> Do you mean that before training, we should have a corpus (T) and 
>> vocabulary (VOC); and replace absent words by UNK in the training 
>> corpus? (I thought VOC is made from T by 1-gram)
> Yes
>>
>> In this case, how about unseen words that don't belong to VOC during 
>> the evaluation ? Should we replace them by UNK and take the 
>> probability already computed in the Model?
> Yes
>
> Both of these substitutions happen automatically in SRILM when you 
> specify the vocabulary with -vocab and also use the -unk option.
> Other tools may do it differently.   Note:  SRILM uses <unk> instead 
> of <UNK>.
>
>>
>> What then is smoothing for?
> Smoothing is primarily for allowing unseen ngrams (not just 
> unigrams).   For example, even though "mondays" occurred in the 
> training data you might not have seen the ngram "i like mondays". 
> Smoothing removes some probability from all the observed ngrams "i 
> like ..."  and gives it to unseen ngrams that start with "i like".
>
> Andreas
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20161019/3efb9c68/attachment.html>


More information about the SRILM-User mailing list