Efficient Cross-Domain Classification of Weblogs

Elisabeth Lex, Christin Seifert, Michael Granitzer, Andreas Juffinger

    Research output: Contribution to journalArticleAcademicpeer-review

    31 Downloads (Pure)

    Abstract

    Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.
    Original languageEnglish
    Pages (from-to)69-76
    Number of pages8
    JournalInternational Journal of Intelligent Computing Research
    Volume1
    Issue number3
    DOIs
    Publication statusPublished - 2010

    Fingerprint

    Classifiers
    Blogs
    Support vector machines
    Data mining
    Visualization
    Costs
    Experiments

    Cite this

    Lex, Elisabeth ; Seifert, Christin ; Granitzer, Michael ; Juffinger, Andreas. / Efficient Cross-Domain Classification of Weblogs. In: International Journal of Intelligent Computing Research. 2010 ; Vol. 1, No. 3. pp. 69-76.
    @article{28785019c296467b88ccfecda3262689,
    title = "Efficient Cross-Domain Classification of Weblogs",
    abstract = "Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.",
    author = "Elisabeth Lex and Christin Seifert and Michael Granitzer and Andreas Juffinger",
    year = "2010",
    doi = "10.20533/ijicr.2042.4655.2010.0007",
    language = "English",
    volume = "1",
    pages = "69--76",
    journal = "International Journal of Intelligent Computing Research",
    issn = "2042-4655",
    publisher = "Infonomics Society",
    number = "3",

    }

    Efficient Cross-Domain Classification of Weblogs. / Lex, Elisabeth; Seifert, Christin; Granitzer, Michael; Juffinger, Andreas.

    In: International Journal of Intelligent Computing Research, Vol. 1, No. 3, 2010, p. 69-76.

    Research output: Contribution to journalArticleAcademicpeer-review

    TY - JOUR

    T1 - Efficient Cross-Domain Classification of Weblogs

    AU - Lex, Elisabeth

    AU - Seifert, Christin

    AU - Granitzer, Michael

    AU - Juffinger, Andreas

    PY - 2010

    Y1 - 2010

    N2 - Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.

    AB - Text classification is one of the core applications in data mining due to the huge amount of uncategorized textual data available. Training a text classifier results in a classification model that reflects the characteristics of the domain it was learned on. However, if no training data is available, labeled data from a related but different domain might be exploited to perform cross-domain classification. In our work, we aim to accurately classify unlabeled weblogs into commonly agreed upon newspaper categories using labeled data from the news domain. The labeled news and the unlabeled blog corpus are highly dynamic and hourly growing with a topic drift, so the classification needs to be efficient. Our approach is to apply a fast novel centroid-based text classification algorithm, the Class-Feature-Centroid Classifier (CFC), to perform efficient cross-domain classification. Experiments showed that this algorithm achieves a comparable accuracy than k-Nearest Neighbour (k-NN) and Support Vector Machines (SVM), yet at linear time cost for training and classification. We investigate the classifier performance and generalization ability using a special visualization of classifiers. The benefit of our approach is that the linear time complexity enables us to efficiently generate an accurate classifier, reflecting the topic drift, several times per day on a huge dataset.

    UR - http://infonomics-society.org/ijicr/

    U2 - 10.20533/ijicr.2042.4655.2010.0007

    DO - 10.20533/ijicr.2042.4655.2010.0007

    M3 - Article

    VL - 1

    SP - 69

    EP - 76

    JO - International Journal of Intelligent Computing Research

    JF - International Journal of Intelligent Computing Research

    SN - 2042-4655

    IS - 3

    ER -