Question Concerning ARPA-Format

Tue Jul 28 16:21:00 PDT 2009

In message <454548.84797.qm at web63405.mail.re1.yahoo.com>you wrote:
> 
> Dear Andreas Stolcke,
> I have a question concerning your toolkit/arpa-format. I know, that this question c
> ould probably be answered by doing research - but after exhaustive research I found
>  no real answer...
> 
> I want to include a list of, say, syntactically equal words in an ARPA-slm, if poss
> ible as an external file. With this, my input-sentences would look like this, f.e.:
> 
> "Please give me the OBJECT"
> "Can I have the OBJECT"
> 
> OBJECT: spoon, book, remote-control ... (these in an external file)
> 
> 
> Can you have such an external reference with ARPA and your toolkit - or do you have
>  to copy the sentences, like this:
> 
> "Please give me the spoon"
> "Please give me the book"
> "Please give me the remote-control"
> 
> "Can I have the spoon"
> "Can I have the book"
> "Can I have the remote-control"
> 
> 
> It would be great, if you could give me a brief answer.

What you are describing is known as a "class-based" ngram LM.
It is supported by SRILM.

The steps are roughly:

1. Define the classes and their membership.
   The format is defined in the classes-format(5) man page.
   You can create one by hand, or induce word classes from a corpus based on 
   bigram cooccurrence statistics, using the ngram-class(1) program.

2. Preprocess your training corpus to replace words with classes.  
   See the replace-words-with-classes script described in the training-scripts(1)
   man page.

3. Training a standard ngram on the processed data, using ngram-count(1).

4. Test the class-based LM using ngram or another tool, supplying both the LM file
   and the class definitions file (from step 1), via the -classes option. 
   See the ngram(1) man page.

Andreas