The research described in this thesis examined the contribution of text analysis to detecting indications of fraud in the annual reports of companies worldwide. A total of 1,727 annual reports have been collected, of which 402 are of the years and companies in which fraudulent activities took place, and which have an impact on the information disclosed in the annual report. A method for the automatic extraction of information from annual reports has been proposed to obtain the data needed to compile this data set. The approach has also been used to extract the Management Discussion & Analysis (MD&A) section, which is the part of the annual report the research in this thesis focuses on. The first models developed for the research described in this thesis, analyze the texts by counting the words (unigrams). This representation of the text is the input of the machine learning algorithms Naive Bayes (NB) and Support Vector Machine (SVM). These algorithms learn patterns to classify the annual reports as `fraud' or `no fraud'. Subsequently, the NB and SVM models based on unigrams are expanded with the linguistic features of the text found to be informative in the previous research concerning the detection of fraud or deception in text. We subdivided the linguistic features into six categories: consecutive words (bigrams), descriptive, complexity, grammatical, readability, psychological and grammatical relations. Finally, a Convolutional Neural Network (CNN) model has been applied. With this model, each word in the text is represented by vectors (word embeddings). Word embeddings aim to include the semantic relationships between words, in addition to the representation of the individual words. The results show that text analysis can contribute to the detection of indications of fraud. However, it desirable to further enhance the performance of the models.
|Award date||21 Dec 2017|
|Place of Publication||Enschede|
|Publication status||Published - 21 Dec 2017|