Question Concerning ARPA-Format
Andreas Stolcke
stolcke at speech.sri.com
Tue Jul 28 16:21:00 PDT 2009
In message <454548.84797.qm at web63405.mail.re1.yahoo.com>you wrote:
>
> Dear Andreas Stolcke,
> I have a question concerning your toolkit/arpa-format. I know, that this question c
> ould probably be answered by doing research - but after exhaustive research I found
> no real answer...
>
> I want to include a list of, say, syntactically equal words in an ARPA-slm, if poss
> ible as an external file. With this, my input-sentences would look like this, f.e.:
>
> "Please give me the OBJECT"
> "Can I have the OBJECT"
>
> OBJECT: spoon, book, remote-control ... (these in an external file)
>
>
> Can you have such an external reference with ARPA and your toolkit - or do you have
> to copy the sentences, like this:
>
> "Please give me the spoon"
> "Please give me the book"
> "Please give me the remote-control"
>
> "Can I have the spoon"
> "Can I have the book"
> "Can I have the remote-control"
>
>
> It would be great, if you could give me a brief answer.
What you are describing is known as a "class-based" ngram LM.
It is supported by SRILM.
The steps are roughly:
1. Define the classes and their membership.
The format is defined in the classes-format(5) man page.
You can create one by hand, or induce word classes from a corpus based on
bigram cooccurrence statistics, using the ngram-class(1) program.
2. Preprocess your training corpus to replace words with classes.
See the replace-words-with-classes script described in the training-scripts(1)
man page.
3. Training a standard ngram on the processed data, using ngram-count(1).
4. Test the class-based LM using ngram or another tool, supplying both the LM file
and the class definitions file (from step 1), via the -classes option.
See the ngram(1) man page.
Andreas
More information about the SRILM-User
mailing list