Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment

Research output: Chapter in Book/Report/Conference proceedingConference contributionAcademicpeer-review

93 Citations (Scopus)
107 Downloads (Pure)

Abstract

There is a growing interest in automatically predicting the gender and age of authors from texts. However, most research so far ignores that language use is related to the social identity of speakers, which may be different from their biological identity. In this paper, we combine insights from sociolinguistics with data collected through an online game, to underline the importance of approaching age and gender as social variables rather than static biological variables. In our game, thousands of players guessed the gender and age of Twitter users based on tweets alone. We show that more than 10% of the Twitter users do not employ language that the crowd associates with their biological sex. It is also shown that older Twitter users are often perceived to be younger. Our findings highlight the limitations of current approaches to gender and age prediction from texts.
Original languageEnglish
Title of host publicationProceedings of the 25th International Conference on Computational Linguistics, COLING 2014
PublisherAssociation for Computational Linguistics (ACL)
Pages1950-1961
Number of pages12
ISBN (Print)978-1-941643-26-6
Publication statusPublished - 23 Aug 2014
Event25th International Conference on Computational Linguistics, COLING 2014 - Dublin, Ireland
Duration: 23 Aug 201429 Aug 2014

Conference

Conference25th International Conference on Computational Linguistics, COLING 2014
Period23/08/1429/08/14
Other23-29 August 2014

Keywords

  • EWI-25496
  • Twitter
  • natural language processing
  • Classification
  • METIS-309770
  • Crowdsourcing
  • Gender
  • IR-94100
  • Age

Fingerprint

Dive into the research topics of 'Why Gender and Age Prediction from Tweets is Hard: Lessons from a Crowdsourcing Experiment'. Together they form a unique fingerprint.

Cite this