[SRILM User List] ARPA format

Wed Oct 19 16:42:29 PDT 2016

On 10/19/2016 1:27 PM, Ana wrote:
>
> Hi all,
>
> something happens when I add the -vocab option, I wonder if is a 
> correct behavior and if both LM are correct?
>
> with -vocab all prob are pretty equal, and without -vocab they change 
> more and for 1-grams there is another prob column...
>
> Please take a look bellow and comment something
>
> best regards
>
> ana
>
Ana,

With -vocab you force the LM to use the vocabulary specified in the word 
list you give.  Without -vocab, the vocabulary consists only of the 
words found in the training data.
In your example, your specified vocabulary contains 237764 word types,  
but your training data seems to have only 10819 word types, so many fewer.

As to the extra column of numbers:   with -vocab, the majority of words 
do not occur in the training set.  Therefore, there won't be any bigrams 
containing those extra words, and therefore the LM contains no backoff 
weights for those extra words.   The backoff weights are the numbers you 
see after the ngrams in the LM file.

For more information on how backoff works in ngram LMs, see this page 
<http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html>.

Andreas

>
> *without -vocab*
>
> \data\
> ngram 1=10819
> ngram 2=58565
>
> \1-grams:
> -4.879262    .    -0.3009124
> -1.284759    </s>
> -99    <s>    -0.5989256
> -1.722562    A    -0.4924272
> -3.040413    A.    -0.4656199
> -4.578232    A.'S    -0.2988251
> -4.879262    A.S    -0.2973903
> -4.335194    ABANDON    -0.3181008
> -4.335194    ABANDONED    -0.4768775
> -4.402141    ABANDONING    -0.535318
> -4.703171    ABBOUD    -0.3001948
> -4.879262    ABBREVIATED    -0.3008665
> -4.879262    ABERRATION    -0.2933786
>
> *
> **using -vocab*
>
> \data\
> ngram 1=237764
> ngram 2=55267
>
> \1-grams:
> -6.536696    !EXCLAMATION-POINT
> -6.536696    "DOUBLE-QUOTE
> -6.536696    %PERCENT
> -6.536696    &AMPERSAND
> -6.536696    &EM
> -6.536696    &FLU
> -6.536696    &NEATH
> -6.536696    &SBLOOD
> -6.536696    &SDEATH
> -6.536696    &TIS
> -6.536696    &TWAS
> -6.536696    &TWEEN
> -6.536696    &TWERE
> -6.536696    &TWIXT
> -6.536696    'AVE
> -6.536696    'CAUSE
> -6.536696    'COS
> -6.536696    'EM
>
>
> On 06/07/16 11:44, Andreas Stolcke wrote:
>> On 7/6/2016 4:57 AM, Bey Youcef wrote:
>>>
>>> Thank you very much for your answer.
>>>
>>> Do you mean that before training, we should have a corpus (T) and 
>>> vocabulary (VOC); and replace absent words by UNK in the training 
>>> corpus? (I thought VOC is made from T by 1-gram)
>> Yes
>>>
>>> In this case, how about unseen words that don't belong to VOC during 
>>> the evaluation ? Should we replace them by UNK and take the 
>>> probability already computed in the Model?
>> Yes
>>
>> Both of these substitutions happen automatically in SRILM when you 
>> specify the vocabulary with -vocab and also use the -unk option.
>> Other tools may do it differently.   Note:  SRILM uses <unk> instead 
>> of <UNK>.
>>
>>>
>>> What then is smoothing for?
>> Smoothing is primarily for allowing unseen ngrams (not just 
>> unigrams).   For example, even though "mondays" occurred in the 
>> training data you might not have seen the ngram "i like mondays". 
>> Smoothing removes some probability from all the observed ngrams "i 
>> like ..."  and gives it to unseen ngrams that start with "i like".
>>
>> Andreas
>>
>>
>> _______________________________________________
>> SRILM-User site list
>> SRILM-User at speech.sri.com
>> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user
>
>
>
> _______________________________________________
> SRILM-User site list
> SRILM-User at speech.sri.com
> http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20161019/c7ebb3b8/attachment.html>