<html>

  <head>

    <meta http-equiv="Content-Type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <div class="moz-cite-prefix">Hi Hanno,</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">The multiword mechanism in SRILM is

      meant for LM applications where you want to chunk certain ngrams

      (typically involving short and high-frequency words) into single

      tokens, e.g.,</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">        going to ->  going_to</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">       i'm going to -> i'm_going_to</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">(some multiwords may subsume others).<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">This has two advantages:  you extend

      the effective span of an LM that is limited to a fixed length

      (like bigram or trigram), and you can more easily model the

      multiword tokens in associated models.  For example, in a speech

      recognizer you would now have a word "going_to" that you can give

      a pronunciation that sounds like "gonna", or  "i'm_going_to" 

      pronounced as "i'm a".   Thus you're capturing coarticulation

      effects.   Some of the same effects would today be modeled by

      finite state transducer composition, but as a first approximation

      this hack was quite effective.<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">You can certainly use this mechanism to

      model word compounding, but I would question if that's the best

      approach.  The multiword uses in English are for ngrams of short

      words that should in some sense be treated as single words (even

      though they are not spelled that way) because they are so

      frequent.  In German compounding you have the opposite problem: 

      there a lot of compound words that are rare because they involve

      too many or rare compounds, so you cannot get good statistics for

      the entire compound as a single word.  In that case you want to

      DEcompound the word for LM purposes, e.g., represent

      "Verkehrschaos" as "Verkehrs_ " +<i> </i>"Chaos" so that you can

      capture the statistics  of new compounds via backoff and smoothing

      (that's a bad example because this is a frequent term, but how

      about "Vehrkehrsschlamassel").<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Now to the function of the specific

      SRILM tools and options you ask about:</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">ngram -multiwords simply splits the

      incoming multiwords on the fly prior to LM evaluation, so that the

      input text can contain multiwords while the LM does not.  A

      typical application is feeding the output of a speech recognizer

      that uses multiwords to a longer-span LM that does NOT use them

      (because multiwords are less effective if the ngram is 4, 5, or

      longer).  This just saves you the trouble of replacing the _ in

      the input with spaces.  The same functionality is available in

      other tools, e.g., in lattice-tool (for lattice rescoring).</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">To train a multiword LM you usually

      preprocess the training data by mapping "going to" ->

      "going_to" etc., then use the ngram-count tool as usual to build

      the LM.  However, that does not work when you don't have access to

      the training data. multi-ngram takes an existing LM WITH

      multiwords (the argument to -multi-lm, typically a lower-order

      one, like a bigram or trigram) and recomputes the probabilities

      according a NON-multiword LM, typically a longer (4,5,6-) gram

      (the argument to -lm).  So you don't need to reprocess the

      training data.   It uses the chain rule, for example, <br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">                log p( going_to | i'm )

      = log p(going| i'm) + log p(to | i'm going) <br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">where the right-hand side is computed

      based on the LM without mws.<br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">Andreas</div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix"><br>

    </div>

    <div class="moz-cite-prefix">On 1/30/2020 2:41 PM, Müller, H.M.

      (Hanno) wrote:<br>

    </div>

    <blockquote type="cite"

      cite="mid:2fb923e6-b4c5-26c0-cf56-e708c45c250e@let.ru.nl">

      <meta http-equiv="content-type" content="text/html; charset=UTF-8">

      <p>Hi,</p>

      <p>I'm trying to understand the concepts behind the option <i>-multiwords</i>

        in the function <i>ngram</i> and the function <i>multi-ngram</i>.

        Let's assume I derive a LM from a big corpus in which compounds

        are also included. In German, compounds make sometimes use of

        non-words. For instance, the word "Verkehrschaos" consists of of

        the words "Verkehr" (traffic), "Chaos" (chaos), and the so

        called Fugen-S "s". If I split all compounds into their

        components, the vocabulary would become greater (because of

        items like the "s"), which could probably increase the

        perplexity while evaluated on a test corpus (where the compounds

        are also split). It probably won't make sense to split words

        like "Verkehrschaos" (into "verkehr s chaos"), which is a

        lexicalized word in German. However, in German new compounds can

        be created very easily, although they'll stay infrequent. Newly

        created compounds are often written with a hyphen, e.g.

        "Sonnen-Auto" (sun car - which could be a car driven by sun

        energy or a trendy word for a cabriolet). It would probably make

        sense to treat this construction as the sequence "sonnen auto".</p>

      <p>I am wondering, whether this is a case where <i>ngram

          -multiwords</i> or <i>multi-ngram</i> could be used and what

        effect that would have on the creation of the LM and for the

        computation of the perplexity of an unseen text based on that

        LM? I guess my very generic question is then, what's the <b>conceptual</b>

        difference between<i> </i>the<i> </i>following commands. (I

        can just compile the two commands, but I'm really interested in

        the <b>underlying concepts</b>.)<br>

      </p>

      <p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds

          -multiwords -multi-char -</i></p>

      <p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds</i></p>

    </blockquote>

    The only difference is that the second version uses the default

    separator character (underscore), so if your input uses hyphens

    instead this will not work as intended.<br>

    <blockquote type="cite"

      cite="mid:2fb923e6-b4c5-26c0-cf56-e708c45c250e@let.ru.nl">

      <p>Can the option only be used while computing the perplexity of

        some text on the basis of a LM? Or can it also be used while

        deriving a LM from a corpus?</p>

    </blockquote>

    <p>It is used to compute the perplexity according to LM without mws

      when the text DOES contain mws (see above).   <br>

    </p>

    <p>For training the LM you would split the mws into their

      components.<br>

    </p>

    <blockquote type="cite"

      cite="mid:2fb923e6-b4c5-26c0-cf56-e708c45c250e@let.ru.nl">

      <p>Furthermore, there is the function <i>multi-ngram </i>which

        is intended to create a LM consisting of multiwords (e.g.

        compounds), right? This LM can then be inserted in a reference

        LM. But what's meant with reference LM? Can somebody illustrate

        this with a small example?</p>

    </blockquote>

    <p>See above!</p>

    <p>Andreas<br>

    </p>

    <br>

  </body>

</html>