<html>

  <head>

    <meta content="text/html; charset=utf-8" http-equiv="Content-Type">

  </head>

  <body bgcolor="#FFFFFF" text="#000000">

    <p>Hi all,</p>

    <p>something happens when I add the -vocab option, I wonder if is a

      correct behavior and if both LM are correct?</p>

    <p>with -vocab all prob are pretty equal, and without -vocab they

      change more and for 1-grams there is another prob column...</p>

    <p>Please take a look bellow and comment something <br>

    </p>

    <p>best regards</p>

    <p>ana</p>

    <p><br>

    </p>

    <p><b>without -vocab</b></p>

    <p>\data\<br>

      ngram 1=10819<br>

      ngram 2=58565<br>

      <br>

      \1-grams:<br>

      -4.879262    .    -0.3009124<br>

      -1.284759    </s><br>

      -99    <s>    -0.5989256<br>

      -1.722562    A    -0.4924272<br>

      -3.040413    A.    -0.4656199<br>

      -4.578232    A.'S    -0.2988251<br>

      -4.879262    A.S    -0.2973903<br>

      -4.335194    ABANDON    -0.3181008<br>

      -4.335194    ABANDONED    -0.4768775<br>

      -4.402141    ABANDONING    -0.535318<br>

      -4.703171    ABBOUD    -0.3001948<br>

      -4.879262    ABBREVIATED    -0.3008665<br>

      -4.879262    ABERRATION    -0.2933786<br>

    </p>

    <b><br>

    </b><b>using -vocab</b><br>

    <br>

    \data\<br>

    ngram 1=237764<br>

    ngram 2=55267<br>

    <br>

    \1-grams:<br>

    -6.536696    !EXCLAMATION-POINT<br>

    -6.536696    "DOUBLE-QUOTE<br>

    -6.536696    %PERCENT<br>

    -6.536696    &AMPERSAND<br>

    -6.536696    &EM<br>

    -6.536696    &FLU<br>

    -6.536696    &NEATH<br>

    -6.536696    &SBLOOD<br>

    -6.536696    &SDEATH<br>

    -6.536696    &TIS<br>

    -6.536696    &TWAS<br>

    -6.536696    &TWEEN<br>

    -6.536696    &TWERE<br>

    -6.536696    &TWIXT<br>

    -6.536696    'AVE<br>

    -6.536696    'CAUSE<br>

    -6.536696    'COS<br>

    -6.536696    'EM<br>

    <br>

    <br>

    <div class="moz-cite-prefix">On 06/07/16 11:44, Andreas Stolcke

      wrote:<br>

    </div>

    <blockquote

      cite="mid:8376a7f2-6e77-e9e2-b074-3930a8ee7d65@icsi.berkeley.edu"

      type="cite">On 7/6/2016 4:57 AM, Bey Youcef wrote:

      <br>

      <blockquote type="cite">

        <br>

        Thank you very much for your answer.

        <br>

        <br>

        Do you mean that before training, we should have a corpus (T)

        and vocabulary (VOC); and replace absent words by UNK in the

        training corpus? (I thought VOC is made from T by 1-gram)

        <br>

      </blockquote>

      Yes

      <br>

      <blockquote type="cite">

        <br>

        In this case, how about unseen words that don't belong to VOC

        during the evaluation ? Should we replace them by UNK and take

        the probability already computed in the Model?

        <br>

      </blockquote>

      Yes

      <br>

      <br>

      Both of these substitutions happen automatically in SRILM when you

      specify the vocabulary with -vocab and also use the -unk option.

      <br>

      Other tools may do it differently.   Note:  SRILM uses <unk>

      instead of <UNK>.

      <br>

      <br>

      <blockquote type="cite">

        <br>

        What then is smoothing for?

        <br>

      </blockquote>

      Smoothing is primarily for allowing unseen ngrams (not just

      unigrams).   For example, even though "mondays" occurred in the

      training data you might not have seen the ngram "i like mondays".

      Smoothing removes some probability from all the observed ngrams "i

      like ..."  and gives it to unseen ngrams that start with "i like".

      <br>

      <br>

      Andreas

      <br>

      <br>

      <br>

      _______________________________________________

      <br>

      SRILM-User site list

      <br>

      <a class="moz-txt-link-abbreviated" href="mailto:SRILM-User@speech.sri.com">SRILM-User@speech.sri.com</a>

      <br>

      <a class="moz-txt-link-freetext" href="http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user">http://mailman.speech.sri.com/cgi-bin/mailman/listinfo/srilm-user</a><br>

    </blockquote>

    <br>

  </body>

</html>