TwitterNEED: a hybrid approach for named entity extraction and disambiguation for tweets

Mena Badieh Habib, Maurice van Keulen

Research output: Contribution to journalArticleAcademicpeer-review

8 Citations (Scopus)

Abstract

Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.
Original languageUndefined
Pages (from-to)423-456
Number of pages34
JournalNatural language engineering
Volume22
Issue number3
DOIs
Publication statusPublished - May 2016

Keywords

  • Named entity recognitionnamed entity extraction named entity linkingnamed entity disambiguationmicroblogstwittertweetsshort messages
  • Twitter
  • tweets
  • microblogs
  • short messages
  • METIS-312605
  • IR-96451
  • Named Entity Disambiguation
  • Named Entity Extraction
  • Named Entity Linking
  • Named Entity Recognition
  • EWI-26014

Cite this

@article{cf41c80fe6324baeab4a50f6dd62bf0d,
title = "TwitterNEED: a hybrid approach for named entity extraction and disambiguation for tweets",
abstract = "Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.",
keywords = "Named entity recognitionnamed entity extraction named entity linkingnamed entity disambiguationmicroblogstwittertweetsshort messages, Twitter, tweets, microblogs, short messages, METIS-312605, IR-96451, Named Entity Disambiguation, Named Entity Extraction, Named Entity Linking, Named Entity Recognition, EWI-26014",
author = "Habib, {Mena Badieh} and {van Keulen}, Maurice",
note = "eemcs-eprint-26014",
year = "2016",
month = "5",
doi = "10.1017/S1351324915000194",
language = "Undefined",
volume = "22",
pages = "423--456",
journal = "Natural language engineering",
issn = "1351-3249",
publisher = "Cambridge University Press",
number = "3",

}

TwitterNEED: a hybrid approach for named entity extraction and disambiguation for tweets. / Habib, Mena Badieh; van Keulen, Maurice.

In: Natural language engineering, Vol. 22, No. 3, 05.2016, p. 423-456.

Research output: Contribution to journalArticleAcademicpeer-review

TY - JOUR

T1 - TwitterNEED: a hybrid approach for named entity extraction and disambiguation for tweets

AU - Habib, Mena Badieh

AU - van Keulen, Maurice

N1 - eemcs-eprint-26014

PY - 2016/5

Y1 - 2016/5

N2 - Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

AB - Twitter is a rich source of continuously and instantly updated information. Shortness and informality of tweets are challenges for Natural Language Processing tasks. In this paper, we present TwitterNEED, a hybrid approach for Named Entity Extraction and Named Entity Disambiguation for tweets. We believe that disambiguation can help to improve the extraction process. This mimics the way humans understand language and reduces error propagation in the whole system. Our extraction approach aims for high extraction recall first, after which a Support Vector Machine attempts to filter out false positives among the extracted candidates using features derived from the disambiguation phase in addition to other word shape and Knowledge Base features. For Named Entity Disambiguation, we obtain a list of entity candidates from the YAGO Knowledge Base in addition to top-ranked pages from the Google search engine for each extracted mention. We use a Support Vector Machine to rank the candidate pages according to a set of URL and context similarity features. For evaluation, five data sets are used to evaluate the extraction approach, and three of them to evaluate both the disambiguation approach and the combined extraction and disambiguation approach. Experiments show better results compared to our competitors DBpedia Spotlight, Stanford Named Entity Recognition, and the AIDA disambiguation system.

KW - Named entity recognitionnamed entity extraction named entity linkingnamed entity disambiguationmicroblogstwittertweetsshort messages

KW - Twitter

KW - tweets

KW - microblogs

KW - short messages

KW - METIS-312605

KW - IR-96451

KW - Named Entity Disambiguation

KW - Named Entity Extraction

KW - Named Entity Linking

KW - Named Entity Recognition

KW - EWI-26014

U2 - 10.1017/S1351324915000194

DO - 10.1017/S1351324915000194

M3 - Article

VL - 22

SP - 423

EP - 456

JO - Natural language engineering

JF - Natural language engineering

SN - 1351-3249

IS - 3

ER -