Text mining to detect indications of fraud in annual reports worldwide

Marcia Valentine Maria Fissette

Research output: ThesisPhD Thesis - Research UT, graduation UTAcademic

514 Downloads (Pure)

Abstract

The research described in this thesis examined the contribution of text analysis to detecting indications of fraud in the annual reports of companies worldwide. A total of 1,727 annual reports have been collected, of which 402 are of the years and companies in which fraudulent activities took place, and which have an impact on the information disclosed in the annual report. A method for the automatic extraction of information from annual reports has been proposed to obtain the data needed to compile this data set. The approach has also been used to extract the Management Discussion & Analysis (MD&A) section, which is the part of the annual report the research in this thesis focuses on. The first models developed for the research described in this thesis, analyze the texts by counting the words (unigrams). This representation of the text is the input of the machine learning algorithms Naive Bayes (NB) and Support Vector Machine (SVM). These algorithms learn patterns to classify the annual reports as `fraud' or `no fraud'. Subsequently, the NB and SVM models based on unigrams are expanded with the linguistic features of the text found to be informative in the previous research concerning the detection of fraud or deception in text. We subdivided the linguistic features into six categories: consecutive words (bigrams), descriptive, complexity, grammatical, readability, psychological and grammatical relations. Finally, a Convolutional Neural Network (CNN) model has been applied. With this model, each word in the text is represented by vectors (word embeddings). Word embeddings aim to include the semantic relationships between words, in addition to the representation of the individual words. The results show that text analysis can contribute to the detection of indications of fraud. However, it desirable to further enhance the performance of the models.
Original languageEnglish
Awarding Institution
  • University of Twente
Supervisors/Advisors
  • Veldkamp, Bernard P., Supervisor
  • de Vries, Theo , Supervisor
Award date21 Dec 2017
Place of PublicationEnschede
Publisher
Print ISBNs978-90-365-4420-7
DOIs
Publication statusPublished - 21 Dec 2017

Fingerprint

thesis
analysis
detection
support vector machine
method
machine learning

Cite this

Fissette, Marcia Valentine Maria. / Text mining to detect indications of fraud in annual reports worldwide. Enschede : University of Twente, 2017. 161 p.
@phdthesis{c73d908095e1433e9af05e7caf273cf6,
title = "Text mining to detect indications of fraud in annual reports worldwide",
abstract = "The research described in this thesis examined the contribution of text analysis to detecting indications of fraud in the annual reports of companies worldwide. A total of 1,727 annual reports have been collected, of which 402 are of the years and companies in which fraudulent activities took place, and which have an impact on the information disclosed in the annual report. A method for the automatic extraction of information from annual reports has been proposed to obtain the data needed to compile this data set. The approach has also been used to extract the Management Discussion & Analysis (MD&A) section, which is the part of the annual report the research in this thesis focuses on. The first models developed for the research described in this thesis, analyze the texts by counting the words (unigrams). This representation of the text is the input of the machine learning algorithms Naive Bayes (NB) and Support Vector Machine (SVM). These algorithms learn patterns to classify the annual reports as `fraud' or `no fraud'. Subsequently, the NB and SVM models based on unigrams are expanded with the linguistic features of the text found to be informative in the previous research concerning the detection of fraud or deception in text. We subdivided the linguistic features into six categories: consecutive words (bigrams), descriptive, complexity, grammatical, readability, psychological and grammatical relations. Finally, a Convolutional Neural Network (CNN) model has been applied. With this model, each word in the text is represented by vectors (word embeddings). Word embeddings aim to include the semantic relationships between words, in addition to the representation of the individual words. The results show that text analysis can contribute to the detection of indications of fraud. However, it desirable to further enhance the performance of the models.",
author = "Fissette, {Marcia Valentine Maria}",
year = "2017",
month = "12",
day = "21",
doi = "10.3990/1.9789036544207",
language = "English",
isbn = "978-90-365-4420-7",
publisher = "University of Twente",
address = "Netherlands",
school = "University of Twente",

}

Text mining to detect indications of fraud in annual reports worldwide. / Fissette, Marcia Valentine Maria.

Enschede : University of Twente, 2017. 161 p.

Research output: ThesisPhD Thesis - Research UT, graduation UTAcademic

TY - THES

T1 - Text mining to detect indications of fraud in annual reports worldwide

AU - Fissette, Marcia Valentine Maria

PY - 2017/12/21

Y1 - 2017/12/21

N2 - The research described in this thesis examined the contribution of text analysis to detecting indications of fraud in the annual reports of companies worldwide. A total of 1,727 annual reports have been collected, of which 402 are of the years and companies in which fraudulent activities took place, and which have an impact on the information disclosed in the annual report. A method for the automatic extraction of information from annual reports has been proposed to obtain the data needed to compile this data set. The approach has also been used to extract the Management Discussion & Analysis (MD&A) section, which is the part of the annual report the research in this thesis focuses on. The first models developed for the research described in this thesis, analyze the texts by counting the words (unigrams). This representation of the text is the input of the machine learning algorithms Naive Bayes (NB) and Support Vector Machine (SVM). These algorithms learn patterns to classify the annual reports as `fraud' or `no fraud'. Subsequently, the NB and SVM models based on unigrams are expanded with the linguistic features of the text found to be informative in the previous research concerning the detection of fraud or deception in text. We subdivided the linguistic features into six categories: consecutive words (bigrams), descriptive, complexity, grammatical, readability, psychological and grammatical relations. Finally, a Convolutional Neural Network (CNN) model has been applied. With this model, each word in the text is represented by vectors (word embeddings). Word embeddings aim to include the semantic relationships between words, in addition to the representation of the individual words. The results show that text analysis can contribute to the detection of indications of fraud. However, it desirable to further enhance the performance of the models.

AB - The research described in this thesis examined the contribution of text analysis to detecting indications of fraud in the annual reports of companies worldwide. A total of 1,727 annual reports have been collected, of which 402 are of the years and companies in which fraudulent activities took place, and which have an impact on the information disclosed in the annual report. A method for the automatic extraction of information from annual reports has been proposed to obtain the data needed to compile this data set. The approach has also been used to extract the Management Discussion & Analysis (MD&A) section, which is the part of the annual report the research in this thesis focuses on. The first models developed for the research described in this thesis, analyze the texts by counting the words (unigrams). This representation of the text is the input of the machine learning algorithms Naive Bayes (NB) and Support Vector Machine (SVM). These algorithms learn patterns to classify the annual reports as `fraud' or `no fraud'. Subsequently, the NB and SVM models based on unigrams are expanded with the linguistic features of the text found to be informative in the previous research concerning the detection of fraud or deception in text. We subdivided the linguistic features into six categories: consecutive words (bigrams), descriptive, complexity, grammatical, readability, psychological and grammatical relations. Finally, a Convolutional Neural Network (CNN) model has been applied. With this model, each word in the text is represented by vectors (word embeddings). Word embeddings aim to include the semantic relationships between words, in addition to the representation of the individual words. The results show that text analysis can contribute to the detection of indications of fraud. However, it desirable to further enhance the performance of the models.

U2 - 10.3990/1.9789036544207

DO - 10.3990/1.9789036544207

M3 - PhD Thesis - Research UT, graduation UT

SN - 978-90-365-4420-7

PB - University of Twente

CY - Enschede

ER -