Abstract
The challenge tackled in this thesis is combining the methods of sentiment analysis and named entity linking in a data stream environment applied to the task of assessing the expertise quality of financial news article authors. To do this, existing streaming methodology in both fields is reviewed and their quality in this context is assessed. Using these insights, a novel stream-based sentiment model and two novel named entity recognition (NER) post-processing methods are introduced to improve the current state-of-the-art. The work reviews currently available lexicon methodology for sentiment analysis on unlabeled data and will formulate a novel unsupervised method for lexicon-based sentiment analysis in data streams. The best-performing current sentiment lexicon on the validation data is SentiWordNet 3 achieving an accuracy of 0.616 and F1-score of 0.730. A novel adaptive unsupervised sentiment analysis (AdaUSA) is introduced. AdaUSA updates a baseline lexicon under a data stream by learning and forgetting information in either a tumbled or slid window paradigm. Using the optimal configuration of AdaUSA, the experiments show an accuracy of 0.666 and F1-score of 0.757 on the validation data, thus denoting an increase in performance of 8.1% compared to the SentiWordNet 3 baseline. Additionally, the work reviews current methodology in named entity recognition and named entity linking. Using our custom tagging algorithm, Stanford CoreNLP gives the best performing method with a classification accuracy of 0.487 and F1-score of 0.250. Combining this model with the novel post-processing ADom-selection and RDom-selection methods presented in this thesis, the accuracy can be improved to at most 0.741 and 0.812 respectively. Finally, the tagged target companies are shown to be correctly linked to their corresponding semantic web entities using DBPedia Spotlight with an accuracy of at most 0.776. The insights in both tasks are combined into a stream-based architecture. The environments of Apache Storm and Apache Flink were implemented for sentiment analysis and named entity linking and the computational complexity is compared. Apache Flink showed the fastest article throughput on the test system using the best performing models on the validation data with an average processing time of 436.22 seconds per 1000 articles. The system with the highest performance on the validation data is hence an Apache Flink environment using AdaUSA with the SentiWordNet 3 lexicon as baseline for sentiment analysis, Stanford CoreNLP using RDom-selection for named entity recognition and DBPedia Spotlight for named entity linking. Deploying the full architecture against the validation set of authors, the work finds significant differences in author quality per industry and over time. Especially in assessing the sentiment of news articles, the results shows a group of six well performing authors and four poorly performing authors. Assessing the impact the authors in the validation dataset have on the stock prices of the companies they write about, the study finds that the poorest performing author has a score of 0.223, while the best performing author has a score of 0.601.