Text Categorisation using SRILM package
Amin Mantrach
amantrac at ulb.ac.be
Tue Apr 8 06:08:19 PDT 2008
Best srilm users,
I wanted to have your opinion about the using of the SRILM package
for text categorisation purpose. My goal is to compare on some known
data sets (newsgroup, Reuters,...) and other data sets the performance
in classification of the SRILM package to some well known other
techniques (SVMs, Decision Trees,...) that are given good results.
The unique problem I'm facing is that the SRILM package is well huge
and I will be embarrassed if the "wrongly" way I'm configuring the
package infers into the results. So I summit you the methodology I'll
use in order to have your advices, suggestions and corrections.
Each data set (pre-processed with stop-words and stemming) has a
number of categories. Each document belong to a unique category (multi-
class , mono-label). For each category I build a trainingFile
containing all the documents of that category. Then for the category I
get model file using the following command :
ngram-count -text trainingFile -lm modelFile
I'm using 10 fold cross-validation for avoiding over-fitting purposes.
So each trainingFile consists of 90% of the documents.
The model obtained is tested on the 10% with the following command
ngram -lm modelFile -ppl testFile -debug 0
The output gives me the perplexity as well as the logprob. I consider
the logprob as the likelihood of the data it is = log P(documents |
category)
(Is it ok to use directly the logprob? Or should I use the perplexity.
Since each category has his own vocabulary, may be oovs could
influence in the categorisation? )
For the categorisation I'm using the bayes rule : P(category |
document ) = P(documents | category) * P(category) /P(document).
Since P(document) is constant for different categories. I obtained the
posterior proba simply by P(documents | category) * P(category). I'm
estimating the prior as the portions of total documents classified in
that category.
Finally I'm classifying a document into the category given the max
posterior proba (P(category | document ) ).
Is for you this simple test sufficiently good for assessing the
performance in classification of the SRILM package or is it mandatory
to use other commands for taking into account other features (such as
oovs,...)?
Thank you for your contribution. I hope that this question will help
other users after also.
@min.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://www.speech.sri.com/pipermail/srilm-user/attachments/20080408/41ab5e37/attachment.html>
More information about the SRILM-User
mailing list