Named entity extraction and disambiguation for informal text: the missing link

Abstract

Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. An example of a main sector for social media analysis is the area of customer feedback through social media. With so many feedback channels, organizations can mix and match them to best suit corporate needs and customer preferences. Another beneficial sector is social security. Automatic monitoring and gathering of information posted in social media can be helpful to take actions to prevent violent and destructive behavior. A main challenge of natural language is its ambiguity and vagueness. To automatically resolve ambiguity by computers, the grammatical structure of sentences is used, for instance, which groups of words go together (phrases) and which words are the subject or object of a verb. However, when we move to informal language widely used in social media, the language becomes more ambiguous and thus more challenging for automatic understanding. Information Extraction (IE) is the research field that enables the use of unstructured text in a structured way. Named Entity Extraction (NEE) is a sub task of IE that aims to locate phrases (mentions) in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation (NED) is the task of determining which correct person, place, event, etc. is referred to by a mention. The main goal of this thesis is to mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. The proposed methods open the doors for more sophisticated applications based on users' contributions on social media. We propose a robust combined framework for NEE and NED in semi-formal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against shortness in labeled training data and against the informality of the used language. We have discovered a reinforcement effect and exploited it a technique that improves extraction quality by feeding back disambiguation results. We present a method of handling the uncertainty involved in extraction to improve the disambiguation results. A generic approach for NED in tweets for any named entity (not entity oriented) is presented. This approach overcomes the problem of limited coverage of KBs. Mentions are disambiguated by assigning them to either a Wikipedia article or a home page. We also introduce a method to enrich the limited entity context.
Original languageEnglish
Awarding Institution
  • University of Twente
Supervisors/Advisors
  • Apers, Peter M.G., Supervisor
  • van Keulen, Maurice , Advisor
Date of Award9 May 2014
Place of PublicationEnschede
Print ISBNs978-90-365-3647-9
DOIs
StatePublished - 9 May 2014

Fingerprint

Information extraction
Field research
Monitoring
Reinforcement
Vagueness
Social security
Informality
Wikipedia
Media analysis
Uncertainty
Robustness

Keywords

  • Named entity extractionNamed entity disambiguationNamed entity recognitionNamed entity linkingTwitterTweetsMicroblogsUncertainty
  • METIS-303481
  • EWI-24790
  • IR-90645

Cite this

@misc{e6b7ee33500b4b10845c8e2c2a316666,
title = "Named entity extraction and disambiguation for informal text: the missing link",
abstract = "Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. An example of a main sector for social media analysis is the area of customer feedback through social media. With so many feedback channels, organizations can mix and match them to best suit corporate needs and customer preferences. Another beneficial sector is social security. Automatic monitoring and gathering of information posted in social media can be helpful to take actions to prevent violent and destructive behavior. A main challenge of natural language is its ambiguity and vagueness. To automatically resolve ambiguity by computers, the grammatical structure of sentences is used, for instance, which groups of words go together (phrases) and which words are the subject or object of a verb. However, when we move to informal language widely used in social media, the language becomes more ambiguous and thus more challenging for automatic understanding. Information Extraction (IE) is the research field that enables the use of unstructured text in a structured way. Named Entity Extraction (NEE) is a sub task of IE that aims to locate phrases (mentions) in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation (NED) is the task of determining which correct person, place, event, etc. is referred to by a mention. The main goal of this thesis is to mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. The proposed methods open the doors for more sophisticated applications based on users' contributions on social media. We propose a robust combined framework for NEE and NED in semi-formal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against shortness in labeled training data and against the informality of the used language. We have discovered a reinforcement effect and exploited it a technique that improves extraction quality by feeding back disambiguation results. We present a method of handling the uncertainty involved in extraction to improve the disambiguation results. A generic approach for NED in tweets for any named entity (not entity oriented) is presented. This approach overcomes the problem of limited coverage of KBs. Mentions are disambiguated by assigning them to either a Wikipedia article or a home page. We also introduce a method to enrich the limited entity context.",
keywords = "Named entity extractionNamed entity disambiguationNamed entity recognitionNamed entity linkingTwitterTweetsMicroblogsUncertainty, METIS-303481, EWI-24790, IR-90645",
author = "Habib, {Mena Badieh}",
note = "SIKS Dissertation Series No. 2014-20",
year = "2014",
month = "5",
doi = "10.3990/1.9789036536479",
isbn = "978-90-365-3647-9",
school = "University of Twente",

}

Named entity extraction and disambiguation for informal text: the missing link. / Habib, Mena Badieh.

Enschede, 2014. 220 p.

Research output: ScientificPhD Thesis - Research UT, graduation UT

TY - THES

T1 - Named entity extraction and disambiguation for informal text: the missing link

AU - Habib,Mena Badieh

N1 - SIKS Dissertation Series No. 2014-20

PY - 2014/5/9

Y1 - 2014/5/9

N2 - Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. An example of a main sector for social media analysis is the area of customer feedback through social media. With so many feedback channels, organizations can mix and match them to best suit corporate needs and customer preferences. Another beneficial sector is social security. Automatic monitoring and gathering of information posted in social media can be helpful to take actions to prevent violent and destructive behavior. A main challenge of natural language is its ambiguity and vagueness. To automatically resolve ambiguity by computers, the grammatical structure of sentences is used, for instance, which groups of words go together (phrases) and which words are the subject or object of a verb. However, when we move to informal language widely used in social media, the language becomes more ambiguous and thus more challenging for automatic understanding. Information Extraction (IE) is the research field that enables the use of unstructured text in a structured way. Named Entity Extraction (NEE) is a sub task of IE that aims to locate phrases (mentions) in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation (NED) is the task of determining which correct person, place, event, etc. is referred to by a mention. The main goal of this thesis is to mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. The proposed methods open the doors for more sophisticated applications based on users' contributions on social media. We propose a robust combined framework for NEE and NED in semi-formal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against shortness in labeled training data and against the informality of the used language. We have discovered a reinforcement effect and exploited it a technique that improves extraction quality by feeding back disambiguation results. We present a method of handling the uncertainty involved in extraction to improve the disambiguation results. A generic approach for NED in tweets for any named entity (not entity oriented) is presented. This approach overcomes the problem of limited coverage of KBs. Mentions are disambiguated by assigning them to either a Wikipedia article or a home page. We also introduce a method to enrich the limited entity context.

AB - Social media content represents a large portion of all textual content appearing on the Internet. These streams of user generated content (UGC) provide an opportunity and challenge for media analysts to analyze huge amount of new data and use them to infer and reason with new information. An example of a main sector for social media analysis is the area of customer feedback through social media. With so many feedback channels, organizations can mix and match them to best suit corporate needs and customer preferences. Another beneficial sector is social security. Automatic monitoring and gathering of information posted in social media can be helpful to take actions to prevent violent and destructive behavior. A main challenge of natural language is its ambiguity and vagueness. To automatically resolve ambiguity by computers, the grammatical structure of sentences is used, for instance, which groups of words go together (phrases) and which words are the subject or object of a verb. However, when we move to informal language widely used in social media, the language becomes more ambiguous and thus more challenging for automatic understanding. Information Extraction (IE) is the research field that enables the use of unstructured text in a structured way. Named Entity Extraction (NEE) is a sub task of IE that aims to locate phrases (mentions) in the text that represent names of entities such as persons, organizations or locations regardless of their type. Named Entity Disambiguation (NED) is the task of determining which correct person, place, event, etc. is referred to by a mention. The main goal of this thesis is to mimic the human way of recognition and disambiguation of named entities especially for domains that lack formal sentence structure. The proposed methods open the doors for more sophisticated applications based on users' contributions on social media. We propose a robust combined framework for NEE and NED in semi-formal and informal text. The achieved robustness has been proven to be valid across languages and domains and to be independent of the selected extraction and disambiguation techniques. It is also shown to be robust against shortness in labeled training data and against the informality of the used language. We have discovered a reinforcement effect and exploited it a technique that improves extraction quality by feeding back disambiguation results. We present a method of handling the uncertainty involved in extraction to improve the disambiguation results. A generic approach for NED in tweets for any named entity (not entity oriented) is presented. This approach overcomes the problem of limited coverage of KBs. Mentions are disambiguated by assigning them to either a Wikipedia article or a home page. We also introduce a method to enrich the limited entity context.

KW - Named entity extractionNamed entity disambiguationNamed entity recognitionNamed entity linkingTwitterTweetsMicroblogsUncertainty

KW - METIS-303481

KW - EWI-24790

KW - IR-90645

U2 - 10.3990/1.9789036536479

DO - 10.3990/1.9789036536479

M3 - PhD Thesis - Research UT, graduation UT

SN - 978-90-365-3647-9

ER -