Abstract
Original language | English |
---|---|
Pages (from-to) | 69-76 |
Number of pages | 8 |
Journal | International Journal of Intelligent Computing Research |
Volume | 1 |
Issue number | 3 |
DOIs | |
Publication status | Published - 2010 |
Fingerprint
Cite this
}
Efficient Cross-Domain Classification of Weblogs. / Lex, Elisabeth; Seifert, Christin; Granitzer, Michael; Juffinger, Andreas.
In: International Journal of Intelligent Computing Research, Vol. 1, No. 3, 2010, p. 69-76.Research output: Contribution to journal › Article › Academic › peer-review
TY - JOUR
T1 - Efficient Cross-Domain Classification of Weblogs
AU - Lex, Elisabeth
AU - Seifert, Christin
AU - Granitzer, Michael
AU - Juffinger, Andreas
PY - 2010
Y1 - 2010
N2 - Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.
AB - Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.
UR - http://infonomics-society.org/ijicr/
U2 - 10.20533/ijicr.2042.4655.2010.0007
DO - 10.20533/ijicr.2042.4655.2010.0007
M3 - Article
VL - 1
SP - 69
EP - 76
JO - International Journal of Intelligent Computing Research
JF - International Journal of Intelligent Computing Research
SN - 2042-4655
IS - 3
ER -