</s> Backoff missing

Wed Aug 21 13:48:17 PDT 2002

In message <B0793DB946E52942A49C1E8152A1358CE68B9A at leo.wins.fb.sony.de>you wrot
e:
> Hi all,
> 
> I have a problem using the toolkit, I create a language model using only the
> ngram-count command:
> 
> ngram-count -text my.text -lm my.arpa -wbdiscount1 -wbdiscount3 -wbdiscount3
> 
> 
> My text file has the setences markers <s> </s>.
> 
> And then the arpa file I get, for the unigram </s> has no backoff weight and
> also all the bigrams that contain </s> as the second word in the bigram have
> no backoff either.
> Does someone know how to get the backoff weight? My problem is that the
> recognizer complains about the format of my language model, since all the
> bigrams without the backoff are not considered and then at the end since
> there are so many it stops.

We get this question a lot.  Technically speaking, backoff weights are 
only required for N-grams that are prefixes of longer N-grams (by the
definition of backoff weights).  Practically speaking, there is a lot 
of software out there that assumes that backoff weights
are assigned to all N-grams except those of highest order.  This is 
very wasteful once you are dealing with pruned (or so-called "variable
length") ngram models.  The script add-dummy-bows will add those backoff
weights that your software is missing.

> 
> I also have another question about the format of the arpa file created.
> Between the probabilities and the words there is not a single space and this
> causes problems also with the recognizer I am using. What I am doing right
> now to avoid this problem is to use a perl script to fix the format and then
> use the converted file that has only a single space, is there an option to
> get a single space??

The toolkit outputs a tab after the probabilities and before the backoff
weights, so as to make things line up visually and make the file more readable.
This is also convenient to search for ngrams or prefixes or suffixes of 
ngrams in the file (by including \t in your search pattern).
again, if your software is too naive about the format then you need 
to bridge the gap, just as you have been doing.  Since all the tools
can read/write stdio you can do this on the fly with a command like

	ngram-count ... -lm - | my-script-to-replace-tabs-with-spaces | \
	gzip > my-fixed-lm.gz 

Hope this helps.

--Andreas