<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
</head>
<body>
<div class="moz-cite-prefix">Hi Hanno,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">The multiword mechanism in SRILM is
meant for LM applications where you want to chunk certain ngrams
(typically involving short and high-frequency words) into single
tokens, e.g.,</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"> going to -> going_to</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"> i'm going to -> i'm_going_to</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">(some multiwords may subsume others).<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">This has two advantages: you extend
the effective span of an LM that is limited to a fixed length
(like bigram or trigram), and you can more easily model the
multiword tokens in associated models. For example, in a speech
recognizer you would now have a word "going_to" that you can give
a pronunciation that sounds like "gonna", or "i'm_going_to"
pronounced as "i'm a". Thus you're capturing coarticulation
effects. Some of the same effects would today be modeled by
finite state transducer composition, but as a first approximation
this hack was quite effective.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">You can certainly use this mechanism to
model word compounding, but I would question if that's the best
approach. The multiword uses in English are for ngrams of short
words that should in some sense be treated as single words (even
though they are not spelled that way) because they are so
frequent. In German compounding you have the opposite problem:
there a lot of compound words that are rare because they involve
too many or rare compounds, so you cannot get good statistics for
the entire compound as a single word. In that case you want to
DEcompound the word for LM purposes, e.g., represent
"Verkehrschaos" as "Verkehrs_ " +<i> </i>"Chaos" so that you can
capture the statistics of new compounds via backoff and smoothing
(that's a bad example because this is a frequent term, but how
about "Vehrkehrsschlamassel").<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Now to the function of the specific
SRILM tools and options you ask about:</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">ngram -multiwords simply splits the
incoming multiwords on the fly prior to LM evaluation, so that the
input text can contain multiwords while the LM does not. A
typical application is feeding the output of a speech recognizer
that uses multiwords to a longer-span LM that does NOT use them
(because multiwords are less effective if the ngram is 4, 5, or
longer). This just saves you the trouble of replacing the _ in
the input with spaces. The same functionality is available in
other tools, e.g., in lattice-tool (for lattice rescoring).</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">To train a multiword LM you usually
preprocess the training data by mapping "going to" ->
"going_to" etc., then use the ngram-count tool as usual to build
the LM. However, that does not work when you don't have access to
the training data. multi-ngram takes an existing LM WITH
multiwords (the argument to -multi-lm, typically a lower-order
one, like a bigram or trigram) and recomputes the probabilities
according a NON-multiword LM, typically a longer (4,5,6-) gram
(the argument to -lm). So you don't need to reprocess the
training data. It uses the chain rule, for example, <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"> log p( going_to | i'm )
= log p(going| i'm) + log p(to | i'm going) <br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">where the right-hand side is computed
based on the LM without mws.<br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">Andreas</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix"><br>
</div>
<div class="moz-cite-prefix">On 1/30/2020 2:41 PM, Müller, H.M.
(Hanno) wrote:<br>
</div>
<blockquote type="cite"
cite="mid:2fb923e6-b4c5-26c0-cf56-e708c45c250e@let.ru.nl">
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
<p>Hi,</p>
<p>I'm trying to understand the concepts behind the option <i>-multiwords</i>
in the function <i>ngram</i> and the function <i>multi-ngram</i>.
Let's assume I derive a LM from a big corpus in which compounds
are also included. In German, compounds make sometimes use of
non-words. For instance, the word "Verkehrschaos" consists of of
the words "Verkehr" (traffic), "Chaos" (chaos), and the so
called Fugen-S "s". If I split all compounds into their
components, the vocabulary would become greater (because of
items like the "s"), which could probably increase the
perplexity while evaluated on a test corpus (where the compounds
are also split). It probably won't make sense to split words
like "Verkehrschaos" (into "verkehr s chaos"), which is a
lexicalized word in German. However, in German new compounds can
be created very easily, although they'll stay infrequent. Newly
created compounds are often written with a hyphen, e.g.
"Sonnen-Auto" (sun car - which could be a car driven by sun
energy or a trendy word for a cabriolet). It would probably make
sense to treat this construction as the sequence "sonnen auto".</p>
<p>I am wondering, whether this is a case where <i>ngram
-multiwords</i> or <i>multi-ngram</i> could be used and what
effect that would have on the creation of the LM and for the
computation of the perplexity of an unseen text based on that
LM? I guess my very generic question is then, what's the <b>conceptual</b>
difference between<i> </i>the<i> </i>following commands. (I
can just compile the two commands, but I'm really interested in
the <b>underlying concepts</b>.)<br>
</p>
<p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds
-multiwords -multi-char -</i></p>
<p><i>ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds</i></p>
</blockquote>
The only difference is that the second version uses the default
separator character (underscore), so if your input uses hyphens
instead this will not work as intended.<br>
<blockquote type="cite"
cite="mid:2fb923e6-b4c5-26c0-cf56-e708c45c250e@let.ru.nl">
<p>Can the option only be used while computing the perplexity of
some text on the basis of a LM? Or can it also be used while
deriving a LM from a corpus?</p>
</blockquote>
<p>It is used to compute the perplexity according to LM without mws
when the text DOES contain mws (see above). <br>
</p>
<p>For training the LM you would split the mws into their
components.<br>
</p>
<blockquote type="cite"
cite="mid:2fb923e6-b4c5-26c0-cf56-e708c45c250e@let.ru.nl">
<p>Furthermore, there is the function <i>multi-ngram </i>which
is intended to create a LM consisting of multiwords (e.g.
compounds), right? This LM can then be inserted in a reference
LM. But what's meant with reference LM? Can somebody illustrate
this with a small example?</p>
</blockquote>
<p>See above!</p>
<p>Andreas<br>
</p>
<br>
</body>
</html>