Efficient Cross-Domain Classification of Weblogs

Elisabeth Lex, Christin Seifert, Michael Granitzer, Andreas Juffinger

    Research output: Contribution to journalArticleAcademicpeer-review

    44 Downloads (Pure)

    Abstract

    Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.
    Original languageEnglish
    Pages (from-to)69-76
    Number of pages8
    JournalInternational Journal of Intelligent Computing Research
    Volume1
    Issue number3
    DOIs
    Publication statusPublished - 2010

    Fingerprint Dive into the research topics of 'Efficient Cross-Domain Classification of Weblogs'. Together they form a unique fingerprint.

  • Cite this