Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment

Dong-Phuong Nguyen, Rudolf Berend Trieschnigg, A. Seza Dogruoz, Rilana Gravel, Mariet Theune, Theo Meder, Franciska M.G. de Jong

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    44 Citations (Scopus)
    41 Downloads (Pure)

    Abstract

    There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.
    Original languageUndefined
    Title of host publicationProceedings of the 25th International Conference on Computational Linguistics, COLING 2014
    PublisherAssociation for Computational Linguistics (ACL)
    Pages1950-1961
    Number of pages12
    ISBN (Print)978-1-941643-26-6
    Publication statusPublished - 23 Aug 2014

    Publication series

    Name
    PublisherAssociation for Computational Linguistics

    Keywords

    • EWI-25496
    • Twitter
    • natural language processing
    • Classification
    • METIS-309770
    • Crowdsourcing
    • Gender
    • IR-94100
    • Age

    Cite this

    Nguyen, D-P., Trieschnigg, R. B., Dogruoz, A. S., Gravel, R., Theune, M., Meder, T., & de Jong, F. M. G. (2014). Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. In Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014 (pp. 1950-1961). Association for Computational Linguistics (ACL).
    Nguyen, Dong-Phuong ; Trieschnigg, Rudolf Berend ; Dogruoz, A. Seza ; Gravel, Rilana ; Theune, Mariet ; Meder, Theo ; de Jong, Franciska M.G. / Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics (ACL), 2014. pp. 1950-1961
    @inproceedings{dfc0be822b9742bba620f7878fedc81b,
    title = "Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment",
    abstract = "There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10{\%} of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.",
    keywords = "EWI-25496, Twitter, natural language processing, Classification, METIS-309770, Crowdsourcing, Gender, IR-94100, Age",
    author = "Dong-Phuong Nguyen and Trieschnigg, {Rudolf Berend} and Dogruoz, {A. Seza} and Rilana Gravel and Mariet Theune and Theo Meder and {de Jong}, {Franciska M.G.}",
    year = "2014",
    month = "8",
    day = "23",
    language = "Undefined",
    isbn = "978-1-941643-26-6",
    publisher = "Association for Computational Linguistics (ACL)",
    pages = "1950--1961",
    booktitle = "Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014",
    address = "United States",

    }

    Nguyen, D-P, Trieschnigg, RB, Dogruoz, AS, Gravel, R, Theune, M, Meder, T & de Jong, FMG 2014, Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. in Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics (ACL), pp. 1950-1961.

    Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. / Nguyen, Dong-Phuong; Trieschnigg, Rudolf Berend; Dogruoz, A. Seza; Gravel, Rilana; Theune, Mariet; Meder, Theo; de Jong, Franciska M.G.

    Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics (ACL), 2014. p. 1950-1961.

    Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

    TY - GEN

    T1 - Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment

    AU - Nguyen, Dong-Phuong

    AU - Trieschnigg, Rudolf Berend

    AU - Dogruoz, A. Seza

    AU - Gravel, Rilana

    AU - Theune, Mariet

    AU - Meder, Theo

    AU - de Jong, Franciska M.G.

    PY - 2014/8/23

    Y1 - 2014/8/23

    N2 - There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.

    AB - There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.

    KW - EWI-25496

    KW - Twitter

    KW - natural language processing

    KW - Classification

    KW - METIS-309770

    KW - Crowdsourcing

    KW - Gender

    KW - IR-94100

    KW - Age

    M3 - Conference contribution

    SN - 978-1-941643-26-6

    SP - 1950

    EP - 1961

    BT - Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014

    PB - Association for Computational Linguistics (ACL)

    ER -

    Nguyen D-P, Trieschnigg RB, Dogruoz AS, Gravel R, Theune M, Meder T et al. Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. In Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics (ACL). 2014. p. 1950-1961