Text Categorisation using SRILM package

Tue Apr 8 06:08:19 PDT 2008

Best srilm users,

I wanted to have your opinion about the using of the SRILM package  
for  text categorisation purpose. My goal is to compare on some known  
data sets (newsgroup, Reuters,...) and other data sets the performance  
in classification of the SRILM package to some well known other  
techniques (SVMs, Decision Trees,...) that are given good results.
The unique problem I'm facing is that the SRILM package is well huge  
and I will be embarrassed if the "wrongly" way I'm configuring the  
package infers into the results. So I summit you the methodology I'll  
use in order to have your advices, suggestions and corrections.

Each data set (pre-processed with stop-words and stemming) has a  
number of categories. Each document belong to a unique category (multi- 
class , mono-label).  For each category I build a trainingFile  
containing all the documents of that category. Then for the category I  
get model file using the following command :
	ngram-count  -text trainingFile -lm modelFile
I'm using 10 fold cross-validation for avoiding over-fitting purposes.  
So each trainingFile consists of 90% of the documents.
The model obtained is tested on the 10% with the following command	
			ngram -lm modelFile -ppl testFile -debug 0

The output gives me the perplexity as well as the logprob. I consider  
the logprob as the likelihood of the data it is = log P(documents |  
category)
(Is it ok to use directly the logprob? Or should I use the perplexity.  
Since each category has his own vocabulary, may be oovs could  
influence in the categorisation? )
For the categorisation I'm using the bayes rule : P(category |  
document ) =  P(documents | category)  * P(category) /P(document).

Since P(document) is constant for different categories. I obtained the  
posterior proba simply by P(documents | category)  * P(category). I'm  
estimating the prior as the portions of total documents classified in  
that category.

Finally I'm classifying a document into the category given the max  
posterior proba (P(category | document ) ).

Is for you this simple test sufficiently good for assessing the  
performance in classification of the SRILM package or is it mandatory  
to use other commands for taking into account other features (such as  
oovs,...)?

Thank you for your contribution. I hope that this question will help  
other users after also.

@min.

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20080408/41ab5e37/attachment.html>