Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.
|Number of pages||8|
|Journal||International Journal of Intelligent Computing Research|
|Publication status||Published - 2010|