<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <div class="moz-cite-prefix">On 10/19/2016 1:27 PM, Ana wrote:<br>

    </div>

    <blockquote

      cite="mid:8fc8e458-c623-d7b5-ab0d-be623e8785b8@cenatav.co.cu"

      type="cite">

      <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

      <p>Hi all,</p>

      <p>something happens when I add the -vocab option, I wonder if is

        a correct behavior and if both LM are correct?</p>

      <p>with -vocab all prob are pretty equal, and without -vocab they

        change more and for 1-grams there is another prob column...</p>

      <p>Please take a look bellow and comment something <br>

      </p>

      <p>best regards</p>

      <p>ana</p>

    </blockquote>

    Ana,<br>

    <br>

    With -vocab you force the LM to use the vocabulary specified in the

    word list you give.  Without -vocab, the vocabulary consists only of

    the words found in the training data.<br>

    In your example, your specified vocabulary contains 237764 word

    types,  but your training data seems to have only 10819 word types,

    so many fewer.<br>

    <br>

    As to the extra column of numbers:   with -vocab, the majority of

    words do not occur in the training set.  Therefore, there won't be

    any bigrams containing those extra words, and therefore the LM

    contains no backoff weights for those extra words.   The backoff

    weights are the numbers you see after the ngrams in the LM file.<br>

    <br>

    For more information on how backoff works in ngram LMs, see <a

href="http://www.speech.sri.com/projects/srilm/manpages/ngram-discount.7.html">this

      page</a>.<br>

    <br>

    Andreas<br>

    <br>

    <blockquote

      cite="mid:8fc8e458-c623-d7b5-ab0d-be623e8785b8@cenatav.co.cu"

      type="cite">

      <p><br>

      </p>

      <p><b>without -vocab</b></p>

      <p>\data\<br>

        ngram 1=10819<br>

        ngram 2=58565<br>

        <br>

        \1-grams:<br>

        -4.879262    .    -0.3009124<br>

        -1.284759    </s><br>

        -99    <s>    -0.5989256<br>

        -1.722562    A    -0.4924272<br>

        -3.040413    A.    -0.4656199<br>

        -4.578232    A.'S    -0.2988251<br>

        -4.879262    A.S    -0.2973903<br>

        -4.335194    ABANDON    -0.3181008<br>

        -4.335194    ABANDONED    -0.4768775<br>

        -4.402141    ABANDONING    -0.535318<br>

        -4.703171    ABBOUD    -0.3001948<br>

        -4.879262    ABBREVIATED    -0.3008665<br>

        -4.879262    ABERRATION    -0.2933786<br>

      </p>

      <b><br>

      </b><b>using -vocab</b><br>

      <br>

      \data\<br>

      ngram 1=237764<br>

      ngram 2=55267<br>

      <br>

      \1-grams:<br>

      -6.536696    !EXCLAMATION-POINT<br>

      -6.536696    "DOUBLE-QUOTE<br>

      -6.536696    %PERCENT<br>

      -6.536696    &AMPERSAND<br>

      -6.536696    &EM<br>

      -6.536696    &FLU<br>

      -6.536696    &NEATH<br>

      -6.536696    &SBLOOD<br>

      -6.536696    &SDEATH<br>

      -6.536696    &TIS<br>

      -6.536696    &TWAS<br>

      -6.536696    &TWEEN<br>

      -6.536696    &TWERE<br>

      -6.536696    &TWIXT<br>

      -6.536696    'AVE<br>

      -6.536696    'CAUSE<br>

      -6.536696    'COS<br>

      -6.536696    'EM<br>

      <br>

      <br>

      <div class="moz-cite-prefix">On 06/07/16 11:44, Andreas Stolcke

        wrote:<br>

      </div>

      <blockquote

        cite="mid:8376a7f2-6e77-e9e2-b074-3930a8ee7d65@icsi.berkeley.edu"

        type="cite">On 7/6/2016 4:57 AM, Bey Youcef wrote: <br>

        <blockquote type="cite"> <br>

          Thank you very much for your answer. <br>

          <br>

          Do you mean that before training, we should have a corpus (T)

          and vocabulary (VOC); and replace absent words by UNK in the

          training corpus? (I thought VOC is made from T by 1-gram) <br>

        </blockquote>

        Yes <br>

        <blockquote type="cite"> <br>

          In this case, how about unseen words that don't belong to VOC

          during the evaluation ? Should we replace them by UNK and take

          the probability already computed in the Model? <br>

        </blockquote>

        Yes <br>

        <br>

        Both of these substitutions happen automatically in SRILM when

        you specify the vocabulary with -vocab and also use the -unk

        option. <br>

        Other tools may do it differently.   Note:  SRILM uses

        <unk> instead of <UNK>. <br>

        <br>

        <blockquote type="cite"> <br>

          What then is smoothing for? <br>

        </blockquote>

        Smoothing is primarily for allowing unseen ngrams (not just

        unigrams).   For example, even though "mondays" occurred in the

        training data you might not have seen the ngram "i like

        mondays". Smoothing removes some probability from all the

        observed ngrams "i like ..."  and gives it to unseen ngrams that

        start with "i like". <br>

        <br>

        Andreas <br>

        <br>

        <br>

        _______________________________________________ <br>

        SRILM-User site list <br>

        <a moz-do-not-send="true" class="moz-txt-link-abbreviated"

          href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a>

        <br>

        <a moz-do-not-send="true" class="moz-txt-link-freetext"

          href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user">http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user</a><br>

      </blockquote>

      <br>

      <br>

      <fieldset class="mimeAttachmentHeader"></fieldset>

      <br>

      <pre wrap="">_______________________________________________

SRILM-User site list

<a class="moz-txt-link-abbreviated" href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a>

<a class="moz-txt-link-freetext" href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user">http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user</a></pre>

    </blockquote>

    <p><br>

    </p>

  </body>

</html>