Abstract
In this Final Master Project, a Machine Learning algorithm for clustering named CluStream was applied in a data streaming context. Additionally, for the data stream
processing, a distributed Apache Spark platform for
massive processing also was applied. The purpose of this project was to apply the CluStream Algorithm to classify
data and distribute the processing close to the data generation using Spark. This project was divided in two phases. The first phase aimed to take some simulated data, create a DataStream, publish the DataStream in a Kafka streaming bus and let Spark Streaming to subscribe the data and apply a clustering algorithm using Spark MLlib. The simulated data was stored in a database and queried frequently in order to simulate the sending data coming from real in-field sensors. Because the clustering algorithm is a nonsupervised algorithm, the Dataset used was a synthetic Dataset where the group classification is well-known. The second phase of this project aimed to present the clustered data in a graphical representation. Additionally, this second part intended to publish the clustered data again in the Kafka bus under another Topic Name and subscribe an additional Database in order to store that clustered data. Then, a NodeJS application was created in order to listen to any data change in the Database and represent that data graphically. The idea in this second part was to present a friendly online
representation of that data that is being consumed and processed by the clustering algorithm.