[SRILM User List] [EXTERNAL] What's the idea behind ngram -multiwords and multi-ngram

Thu Jan 30 18:18:39 PST 2020

Hi Hanno,

The multiword mechanism in SRILM is meant for LM applications where you 
want to chunk certain ngrams (typically involving short and 
high-frequency words) into single tokens, e.g.,

         going to ->  going_to

        i'm going to -> i'm_going_to

(some multiwords may subsume others).

This has two advantages:  you extend the effective span of an LM that is 
limited to a fixed length (like bigram or trigram), and you can more 
easily model the multiword tokens in associated models.  For example, in 
a speech recognizer you would now have a word "going_to" that you can 
give a pronunciation that sounds like "gonna", or  "i'm_going_to" 
pronounced as "i'm a".   Thus you're capturing coarticulation effects.   
Some of the same effects would today be modeled by finite state 
transducer composition, but as a first approximation this hack was quite 
effective.

You can certainly use this mechanism to model word compounding, but I 
would question if that's the best approach.  The multiword uses in 
English are for ngrams of short words that should in some sense be 
treated as single words (even though they are not spelled that way) 
because they are so frequent.  In German compounding you have the 
opposite problem: there a lot of compound words that are rare because 
they involve too many or rare compounds, so you cannot get good 
statistics for the entire compound as a single word.  In that case you 
want to DEcompound the word for LM purposes, e.g., represent 
"Verkehrschaos" as "Verkehrs_ " +//"Chaos" so that you can capture the 
statistics  of new compounds via backoff and smoothing (that's a bad 
example because this is a frequent term, but how about 
"Vehrkehrsschlamassel").

Now to the function of the specific SRILM tools and options you ask about:

ngram -multiwords simply splits the incoming multiwords on the fly prior 
to LM evaluation, so that the input text can contain multiwords while 
the LM does not.  A typical application is feeding the output of a 
speech recognizer that uses multiwords to a longer-span LM that does NOT 
use them (because multiwords are less effective if the ngram is 4, 5, or 
longer).  This just saves you the trouble of replacing the _ in the 
input with spaces.  The same functionality is available in other tools, 
e.g., in lattice-tool (for lattice rescoring).

To train a multiword LM you usually preprocess the training data by 
mapping "going to" -> "going_to" etc., then use the ngram-count tool as 
usual to build the LM.  However, that does not work when you don't have 
access to the training data. multi-ngram takes an existing LM WITH 
multiwords (the argument to -multi-lm, typically a lower-order one, like 
a bigram or trigram) and recomputes the probabilities according a 
NON-multiword LM, typically a longer (4,5,6-) gram (the argument to 
-lm).  So you don't need to reprocess the training data.   It uses the 
chain rule, for example,

                 log p( going_to | i'm ) = log p(going| i'm) + log p(to 
| i'm going)

where the right-hand side is computed based on the LM without mws.

Andreas

On 1/30/2020 2:41 PM, Müller, H.M. (Hanno) wrote:
>
> Hi,
>
> I'm trying to understand the concepts behind the option /-multiwords/ 
> in the function /ngram/ and the function /multi-ngram/. Let's assume I 
> derive a LM from a big corpus in which compounds are also included. In 
> German, compounds make sometimes use of non-words. For instance, the 
> word "Verkehrschaos" consists of of the words "Verkehr" (traffic), 
> "Chaos" (chaos), and the so called Fugen-S "s". If I split all 
> compounds into their components, the vocabulary would become greater 
> (because of items like the "s"), which could probably increase the 
> perplexity while evaluated on a test corpus (where the compounds are 
> also split). It probably won't make sense to split words like 
> "Verkehrschaos" (into "verkehr s chaos"), which is a lexicalized word 
> in German. However, in German new compounds can be created very 
> easily, although they'll stay infrequent. Newly created compounds are 
> often written with a hyphen, e.g. "Sonnen-Auto" (sun car - which could 
> be a car driven by sun energy or a trendy word for a cabriolet). It 
> would probably make sense to treat this construction as the sequence 
> "sonnen auto".
>
> I am wondering, whether this is a case where /ngram -multiwords/ or 
> /multi-ngram/ could be used and what effect that would have on the 
> creation of the LM and for the computation of the perplexity of an 
> unseen text based on that LM? I guess my very generic question is 
> then, what's the *conceptual* difference between//the//following 
> commands. (I can just compile the two commands, but I'm really 
> interested in the *underlying concepts*.)
>
> /ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds 
> -multiwords -multi-char -/
>
> /ngram -lm input.lm -ppl text-with-hyphen-seperated-compounds/
>
The only difference is that the second version uses the default 
separator character (underscore), so if your input uses hyphens instead 
this will not work as intended.
>
> Can the option only be used while computing the perplexity of some 
> text on the basis of a LM? Or can it also be used while deriving a LM 
> from a corpus?
>
It is used to compute the perplexity according to LM without mws when 
the text DOES contain mws (see above).

For training the LM you would split the mws into their components.

> Furthermore, there is the function /multi-ngram /which is intended to 
> create a LM consisting of multiwords (e.g. compounds), right? This LM 
> can then be inserted in a reference LM. But what's meant with 
> reference LM? Can somebody illustrate this with a small example?
>
See above!

Andreas

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://mailman.speech.sri.com/pipermail/srilm-user/attachments/20200130/8894e461/attachment.html>