[SRILM User List] [EXTERNAL] What's the idea behind ngram -multiwords and multi-ngram
Andreas Stolcke
stolcke at icsi.berkeley.edu
Thu Jan 30 18:18:39 PST 2020
Hi Hanno,
The multiword mechanism in SRILM is meant for LM applications where you
want to chunk certain ngrams (typically involving short and
high-frequency words) into single tokens, e.g.,
going to -> going_to
i'm going to -> i'm_going_to
(some multiwords may subsume others).
This has two advantages: you extend the effective span of an LM that is
limited to a fixed length (like bigram or trigram), and you can more
easily model the multiword tokens in associated models. For example, in
a speech recognizer you would now have a word "going_to" that you can
give a pronunciation that sounds like "gonna", or "i'm_going_to"
pronounced as "i'm a". Thus you're capturing coarticulation effects.
Some of the same effects would today be modeled by finite state
transducer composition, but as a first approximation this hack was quite
effective.
You can certainly use this mechanism to model word compounding, but I
would question if that's the best approach. The multiword uses in
English are for ngrams of short words that should in some sense be
treated as single words (even though they are not spelled that way)
because they are so frequent. In German compounding you have the
opposite problem: there a lot of compound words that are rare because
they involve too many or rare compounds, so you cannot get good
statistics for the entire compound as a single word. In that case you
want to DEcompound the word for LM purposes, e.g., represent
"Verkehrschaos" as "Verkehrs_ " +//"Chaos" so that you can capture the
statistics of new compounds via backoff and smoothing (that's a bad
example because this is a frequent term, but how about
"Vehrkehrsschlamassel").
Now to the function of the specific SRILM tools and options you ask about:
ngram -multiwords simply splits the incoming multiwords on the fly prior
to LM evaluation, so that the input text can contain multiwords while
the LM does not. A typical application is feeding the output of a
speech recognizer that uses multiwords to a longer-span LM that does NOT
use them (because multiwords are less effective if the ngram is 4, 5, or
longer). This just saves you the trouble of replacing the _ in the
input with spaces. The same functionality is available in other tools,
e.g., in lattice-tool (for lattice rescoring).
To train a multiword LM you usually preprocess the training data by
mapping "going to" -> "going_to" etc., then use the ngram-count tool as
usual to build the LM. However, that does not work when you don't have
access to the training data. multi-ngram takes an existing LM WITH
multiwords (the argument to -multi-lm, typically a lower-order one, like
a bigram or trigram) and recomputes the probabilities according a
NON-multiword LM, typically a longer (4,5,6-) gram (the argument to
-lm). So you don't need to reprocess the training data. It uses the
chain rule, for example,
log p( going_to | i'm ) = log p(going| i'm) + log p(to
| i'm going)
where the right-hand side is computed based on the LM without mws.
Andreas
On 1/30/2020 2:41 PM, Müller, H.M. (Hanno) wrote:
>
> Hi,
>
> I'm trying to understand the concepts behind the option /-multiwords/
> in the function /ngram/ and the function /multi-ngram/. Let's assume I
> derive a LM from a big corpus in which compounds are also included. In
> German, compounds make sometimes use of non-words. For instance, the
> word "Verkehrschaos" consists of of the words "Verkehr" (traffic),
> "Chaos" (chaos), and the so called Fugen-S "s". If I split all
> compounds into their components, the vocabulary would become greater
> (because of items like the "s"), which could probably increase the
> perplexity while evaluated on a test corpus (where the compounds are
> also split). It probably won't make sense to split words like
> "Verkehrschaos" (into "verkehr s chaos"), which is a lexicalized word
> in German. However, in German new compounds can be created very
> easily, although they'll stay infrequent. Newly created compounds are
> often written with a hyphen, e.g. "Sonnen-Auto" (sun car - which could
> be a car driven by sun energy or a trendy word for a cabriolet). It
> would probably make sense to treat this construction as the sequence
> "sonnen auto".
>
> I am wondering, whether this is a case where /ngram -multiwords/ or
> /multi-ngram/ could be used and what effect that would have on the
> creation of the LM and for the computation of the perplexity of an
> unseen text based on that LM? I guess my very generic question is
> then, what's the *conceptual* difference between//the//following
> commands. (I can just compile the two commands, but I'm really
> interested in the *underlying concepts*.)
>
> /ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds
> -multiwords -multi-char -/
>
> /ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds/
>
The only difference is that the second version uses the default
separator character (underscore), so if your input uses hyphens instead
this will not work as intended.
>
> Can the option only be used while computing the perplexity of some
> text on the basis of a LM? Or can it also be used while deriving a LM
> from a corpus?
>
It is used to compute the perplexity according to LM without mws when
the text DOES contain mws (see above).
For training the LM you would split the mws into their components.
> Furthermore, there is the function /multi-ngram /which is intended to
> create a LM consisting of multiwords (e.g. compounds), right? This LM
> can then be inserted in a reference LM. But what's meant with
> reference LM? Can somebody illustrate this with a small example?
>
See above!
Andreas
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20200130/8894e461/attachment.html>
More information about the SRILM-User
mailing list