<html>

  <head>

    <meta http-equiv="content-type" content="text/html; charset=UTF-8">

  </head>

  <body>

    <p>Hi,</p>

    <p>I'm trying to understand the concepts behind the option <i>-multiwords</i>

      in the function <i>ngram</i> and the function <i>multi-ngram</i>.

      Let's assume I derive a LM from a big corpus in which compounds

      are also included. In German, compounds make sometimes use of

      non-words. For instance, the word "Verkehrschaos" consists of of

      the words "Verkehr" (traffic), "Chaos" (chaos), and the so called

      Fugen-S "s". If I split all compounds into their components, the

      vocabulary would become greater (because of items like the "s"),

      which could probably increase the perplexity while evaluated on a

      test corpus (where the compounds are also split). It probably

      won't make sense to split words like "Verkehrschaos" (into

      "verkehr s chaos"), which is a lexicalized word in German.

      However, in German new compounds can be created very easily,

      although they'll stay infrequent. Newly created compounds are

      often written with a hyphen, e.g. "Sonnen-Auto" (sun car - which

      could be a car driven by sun energy or a trendy word for a

      cabriolet). It would probably make sense to treat this

      construction as the sequence "sonnen auto".</p>

    <p>I am wondering, whether this is a case where <i>ngram

        -multiwords</i> or <i>multi-ngram</i> could be used and what

      effect that would have on the creation of the LM and for the

      computation of the perplexity of an unseen text based on that LM?

      I guess my very generic question is then, what's the <b>conceptual</b>

      difference between<i> </i>the<i> </i>following commands. (I can

      just compile the two commands, but I'm really interested in the <b>underlying

        concepts</b>.)<br>

      <i></i></p>

    <p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds

        -multiwords -multi-char -</i></p>

    <p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds</i></p>

    <p>Can the option only be used while computing the perplexity of

      some text on the basis of a LM? Or can it also be used while

      deriving a LM from a corpus?</p>

    <p>Furthermore, there is the function <i>multi-ngram </i>which is

      intended to create a LM consisting of multiwords (e.g. compounds),

      right? This LM can then be inserted in a reference LM. But what's

      meant with reference LM? Can somebody illustrate this with a small

      example?</p>

    <p>I'm looking forward to your answers :)</p>

    <p>Cheers,</p>

    <p>Hanno<br>

    </p>

    <p><br>

      <i></i></p>

  </body>

</html>