Which financial news to trust?: financial news article author assessment using sentiment analysis and named entity linking in a streaming environment

Raaijmakers, Boy Adrianus Jacobus (2019). Which financial news to trust?: financial news article author assessment using sentiment analysis and named entity linking in a streaming environment. Thesis (Master thesis), E.T.S. de Ingenieros Informáticos (UPM).

Description

Title: Which financial news to trust?: financial news article author assessment using sentiment analysis and named entity linking in a streaming environment
Author/s:
  • Raaijmakers, Boy Adrianus Jacobus
Contributor/s:
  • Rodríguez González, Alejandro
  • Mehdad, Ehsan
Item Type: Thesis (Master thesis)
Masters title: Data Science
Date: July 2019
Subjects:
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Lenguajes y Sistemas Informáticos e Ingeniería del Software
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[img]
Preview
PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB) | Preview

Abstract

The challenge tackled in this thesis is combining the methods of sentiment analysis and named entity linking in a data stream environment applied to the task of assessing the expertise quality of financial news article authors. To do this, existing streaming methodology in both fields is reviewed and their quality in this context is assessed. Using these insights, a novel stream-based sentiment model and two novel named entity recognition (NER) post-processing methods are introduced to improve the current state-of-the-art. The work reviews currently available lexicon methodology for sentiment analysis on unlabeled data and will formulate a novel unsupervised method for lexicon-based sentiment analysis in data streams. The best-performing current sentiment lexicon on the validation data is SentiWordNet 3 achieving an accuracy of 0.616 and F1-score of 0.730. A novel adaptive unsupervised sentiment analysis (AdaUSA) is introduced. AdaUSA updates a baseline lexicon under a data stream by learning and forgetting information in either a tumbled or slid window paradigm. Using the optimal configuration of AdaUSA, the experiments show an accuracy of 0.666 and F1-score of 0.757 on the validation data, thus denoting an increase in performance of 8.1% compared to the SentiWordNet 3 baseline. Additionally, the work reviews current methodology in named entity recognition and named entity linking. Using our custom tagging algorithm, Stanford CoreNLP gives the best performing method with a classification accuracy of 0.487 and F1-score of 0.250. Combining this model with the novel post-processing ADom-selection and RDom-selection methods presented in this thesis, the accuracy can be improved to at most 0.741 and 0.812 respectively. Finally, the tagged target companies are shown to be correctly linked to their corresponding semantic web entities using DBPedia Spotlight with an accuracy of at most 0.776. The insights in both tasks are combined into a stream-based architecture. The environments of Apache Storm and Apache Flink were implemented for sentiment analysis and named entity linking and the computational complexity is compared. Apache Flink showed the fastest article throughput on the test system using the best performing models on the validation data with an average processing time of 436.22 seconds per 1000 articles. The system with the highest performance on the validation data is hence an Apache Flink environment using AdaUSA with the SentiWordNet 3 lexicon as baseline for sentiment analysis, Stanford CoreNLP using RDom-selection for named entity recognition and DBPedia Spotlight for named entity linking. Deploying the full architecture against the validation set of authors, the work finds significant differences in author quality per industry and over time. Especially in assessing the sentiment of news articles, the results shows a group of six well performing authors and four poorly performing authors. Assessing the impact the authors in the validation dataset have on the stock prices of the companies they write about, the study finds that the poorest performing author has a score of 0.223, while the best performing author has a score of 0.601.

More information

Item ID: 56883
DC Identifier: http://oa.upm.es/56883/
OAI Identifier: oai:oa.upm.es:56883
Deposited by: Biblioteca Facultad de Informatica
Deposited on: 15 Oct 2019 11:26
Last Modified: 15 Oct 2019 11:26
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM