<html>
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
</head>
<body>
<p>Hi,</p>
<p>I'm trying to understand the concepts behind the option <i>-multiwords</i>
in the function <i>ngram</i> and the function <i>multi-ngram</i>.
Let's assume I derive a LM from a big corpus in which compounds
are also included. In German, compounds make sometimes use of
non-words. For instance, the word "Verkehrschaos" consists of of
the words "Verkehr" (traffic), "Chaos" (chaos), and the so called
Fugen-S "s". If I split all compounds into their components, the
vocabulary would become greater (because of items like the "s"),
which could probably increase the perplexity while evaluated on a
test corpus (where the compounds are also split). It probably
won't make sense to split words like "Verkehrschaos" (into
"verkehr s chaos"), which is a lexicalized word in German.
However, in German new compounds can be created very easily,
although they'll stay infrequent. Newly created compounds are
often written with a hyphen, e.g. "Sonnen-Auto" (sun car - which
could be a car driven by sun energy or a trendy word for a
cabriolet). It would probably make sense to treat this
construction as the sequence "sonnen auto".</p>
<p>I am wondering, whether this is a case where <i>ngram
-multiwords</i> or <i>multi-ngram</i> could be used and what
effect that would have on the creation of the LM and for the
computation of the perplexity of an unseen text based on that LM?
I guess my very generic question is then, what's the <b>conceptual</b>
difference between<i> </i>the<i> </i>following commands. (I can
just compile the two commands, but I'm really interested in the <b>underlying
concepts</b>.)<br>
<i></i></p>
<p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds
-multiwords -multi-char -</i></p>
<p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds</i></p>
<p>Can the option only be used while computing the perplexity of
some text on the basis of a LM? Or can it also be used while
deriving a LM from a corpus?</p>
<p>Furthermore, there is the function <i>multi-ngram </i>which is
intended to create a LM consisting of multiwords (e.g. compounds),
right? This LM can then be inserted in a reference LM. But what's
meant with reference LM? Can somebody illustrate this with a small
example?</p>
<p>I'm looking forward to your answers :)</p>
<p>Cheers,</p>
<p>Hanno<br>
</p>
<p><br>
<i></i></p>
</body>
</html>