Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment

Dong-Phuong Nguyen, Rudolf Berend Trieschnigg, A. Seza Dogruoz, Rilana Gravel, Mariet Theune, Theo Meder, Franciska M.G. de Jong

  • 28 Citations

Abstract

There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.
Original languageUndefined
Title of host publicationProceedings of the 25th International Conference on Computational Linguistics, COLING 2014
PublisherAssociation for Computational Linguistics
Pages1950-1961
Number of pages12
ISBN (Print)978-1-941643-26-6
StatePublished - 23 Aug 2014

Publication series

Name
PublisherAssociation for Computational Linguistics

Fingerprint

Prediction
Social identity

Keywords

  • EWI-25496
  • Twitter
  • natural language processing
  • Classification
  • METIS-309770
  • Crowdsourcing
  • Gender
  • IR-94100
  • Age

Cite this

Nguyen, D-P., Trieschnigg, R. B., Dogruoz, A. S., Gravel, R., Theune, M., Meder, T., & de Jong, F. M. G. (2014). Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. In Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014 (pp. 1950-1961). Association for Computational Linguistics.

Nguyen, Dong-Phuong; Trieschnigg, Rudolf Berend; Dogruoz, A. Seza; Gravel, Rilana; Theune, Mariet; Meder, Theo; de Jong, Franciska M.G. / Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment.

Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics, 2014. p. 1950-1961.

Research output: Scientific - peer-reviewConference contribution

@inbook{dfc0be822b9742bba620f7878fedc81b,
title = "Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment",
abstract = "There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.",
keywords = "EWI-25496, Twitter, natural language processing, Classification, METIS-309770, Crowdsourcing, Gender, IR-94100, Age",
author = "Dong-Phuong Nguyen and Trieschnigg, {Rudolf Berend} and Dogruoz, {A. Seza} and Rilana Gravel and Mariet Theune and Theo Meder and {de Jong}, {Franciska M.G.}",
year = "2014",
month = "8",
isbn = "978-1-941643-26-6",
publisher = "Association for Computational Linguistics",
pages = "1950--1961",
booktitle = "Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014",

}

Nguyen, D-P, Trieschnigg, RB, Dogruoz, AS, Gravel, R, Theune, M, Meder, T & de Jong, FMG 2014, Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. in Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics, pp. 1950-1961.

Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. / Nguyen, Dong-Phuong; Trieschnigg, Rudolf Berend; Dogruoz, A. Seza; Gravel, Rilana; Theune, Mariet; Meder, Theo; de Jong, Franciska M.G.

Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics, 2014. p. 1950-1961.

Research output: Scientific - peer-reviewConference contribution

TY - CHAP

T1 - Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment

AU - Nguyen,Dong-Phuong

AU - Trieschnigg,Rudolf Berend

AU - Dogruoz,A. Seza

AU - Gravel,Rilana

AU - Theune,Mariet

AU - Meder,Theo

AU - de Jong,Franciska M.G.

PY - 2014/8/23

Y1 - 2014/8/23

N2 - There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.

AB - There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.

KW - EWI-25496

KW - Twitter

KW - natural language processing

KW - Classification

KW - METIS-309770

KW - Crowdsourcing

KW - Gender

KW - IR-94100

KW - Age

M3 - Conference contribution

SN - 978-1-941643-26-6

SP - 1950

EP - 1961

BT - Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014

PB - Association for Computational Linguistics

ER -

Nguyen D-P, Trieschnigg RB, Dogruoz AS, Gravel R, Theune M, Meder T et al. Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment. In Proceedings of the 25th International Conference on Computational Linguistics, COLING 2014. Association for Computational Linguistics. 2014. p. 1950-1961.