[SRILM User List] [EXTERNAL] What's the idea behind ngram -multiwords and multi-ngram

Thu Jan 30 14:41:33 PST 2020

Hi,

I'm trying to understand the concepts behind the option /-multiwords/ in
the function /ngram/ and the function /multi-ngram/. Let's assume I
derive a LM from a big corpus in which compounds are also included. In
German, compounds make sometimes use of non-words. For instance, the
word "Verkehrschaos" consists of of the words "Verkehr" (traffic),
"Chaos" (chaos), and the so called Fugen-S "s". If I split all compounds
into their components, the vocabulary would become greater (because of
items like the "s"), which could probably increase the perplexity while
evaluated on a test corpus (where the compounds are also split). It
probably won't make sense to split words like "Verkehrschaos" (into
"verkehr s chaos"), which is a lexicalized word in German. However, in
German new compounds can be created very easily, although they'll stay
infrequent. Newly created compounds are often written with a hyphen,
e.g. "Sonnen-Auto" (sun car - which could be a car driven by sun energy
or a trendy word for a cabriolet). It would probably make sense to treat
this construction as the sequence "sonnen auto".

I am wondering, whether this is a case where /ngram -multiwords/ or
/multi-ngram/ could be used and what effect that would have on the
creation of the LM and for the computation of the perplexity of an
unseen text based on that LM? I guess my very generic question is then,
what's the *conceptual* difference between//the//following commands. (I
can just compile the two commands, but I'm really interested in the
*underlying concepts*.)
//

/ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds
-multiwords -multi-char -/

/ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds/

Can the option only be used while computing the perplexity of some text
on the basis of a LM? Or can it also be used while deriving a LM from a
corpus?

Furthermore, there is the function /multi-ngram /which is intended to
create a LM consisting of multiwords (e.g. compounds), right? This LM
can then be inserted in a reference LM. But what's meant with reference
LM? Can somebody illustrate this with a small example?

I'm looking forward to your answers :)

Cheers,

Hanno

//

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20200130/567494ab/attachment.html>