SRILM

Sat Jun 9 11:47:21 PDT 2001

In message <3B219F9A.FD1D392E at cs.bilkent.edu.tr>you wrote:
> Hello Andreas,
> 
> My name is Umut Topkara, and I am an MS student in Bilkent University,
> in Turkey. I have been using SRILM for my MS thesis. I would like to
> thank you for providing the code publicly. I've really benefited from it
> a lot. I have made a few additions to the code to use it for deriving
> and applying different language models for prefixes and suffixes of
> Turkish words. I preferred wrapping my code around SRILM code rather
> than changing parts of it. At the time I started writing my code,
> multi-ngram was not available. As far as I see from the source code, it
> could have been a good starting point to add code for a language model
> that eploits morphology.
> 
> I have a comment on the toolkit that I want to share with you. For my
> particular case I can say that, if the toolkit has supported a mapping
> from input words to words looked up in the language models through a
> user defined function, it would have been invaluable. That way a
> morphological processing of the words can be done on the run and can be
> easily integrated into language modeling. Although this might be of
> limited benefit for English, it will have a good impact on modeling of
> languages with more productive and rich morphology.
> 
> Thank you very much again for the toolkit.

Umut,

I'm glad the toolkit was useful to you, and thanks much for your input.   

If you just want a one-to-one mapping of "surface" words to an "internal"
vocabulary you can do that with classes.  Just prepare a class definition
file that looks like

	INTERNAL_WORD 1.0 surface_word
	etc.

and use it with the ngram -classes option.
The LM then needs to be in terms of internal words (i.e., word classes).
For training you need to prepare the data to contain internal words yourself,
but that shouldn't be a problem.

Also, an internal word (i.e., class) can actually expand to a sequence of 
surface words (but not the other way round).

Hope this helps

--Andreas