Full text
![]() |
PDF
- Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (7MB) |
García-Gutiérrez Espina, Miguel Ángel (2023). Study of dimensionality reduction techniques and interpretation of their coefficients, and influence on the learned models. Thesis (Master thesis), E.T.S. de Ingenieros Informáticos (UPM).
Title: | Study of dimensionality reduction techniques and interpretation of their coefficients, and influence on the learned models |
---|---|
Author/s: |
|
Contributor/s: |
|
Item Type: | Thesis (Master thesis) |
Masters title: | Ciencia de Datos |
Date: | July 2023 |
Subjects: | |
Faculty: | E.T.S. de Ingenieros Informáticos (UPM) |
Department: | Inteligencia Artificial |
Creative Commons Licenses: | Recognition - No derivative works - Non commercial |
![]() |
PDF
- Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (7MB) |
Dimensionality reduction is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional space retains the essential information of the data. It aims to overcome the curse of dimensionality, which refers to the challenges posed by high-dimensional data, such as increased computational complexity, the risk of overfitting, and, especially, the reduction of explainability.
The reduction of explainability is addressed by the field of Explainable Artificial Intelligence (XAI), which focuses on understanding machine learning models and explaining them in human and understandable terms. Combining XAI and dimensionality reduction, this thesis presents a method of explaining principal components based on their correlations to the input features.
First, the dimensionality of the data was reduced using state-of-the-art dimensionality reduction techniques such as SLMVP. These reduction techniques were combined with different machine learning classifiers to fine-tune their parameters. The objective was to identify the optimal configuration that achieves the highest accuracy with the given data. The accuracy obtained with only the first k components is measured for different values of k. A recommendation is then given as to the number of components that should be kept.
Second, the performance of the techniques in capturing and preserving the structure of the original dataset is analyzed by plotting their projections in 2 and 3-dimensional plots. We look into whether the data points are evenly distributed or not, this shows how effectively the technique has managed to capture the overall variance of the dataset, and whether the graph exhibits a clear separation of the different classes. This, paired with the accuracy obtained in the previous classification task, tells us about the goodness of the technique. Furthermore, we show that among the supervised dimensionality reduction techniques evaluated, SLMVP stands out as the sole method capable of effectively handling multilabel datasets.
Finally, the correlations between the original data and each one of the components obtained through dimensionality reduction are leveraged to extract meaningful qualitative information. This is based on the fact that the components are the directions of maximum variability of the data and it is fair to assume that the variables that have a high absolute correlation with a component are given a high significance by the dimensionality reduction technique. A recommendation is then given as to which features should be selected for a posterior machine learning task, based on their absolute correlation with the components.
In addition, the correlations are also leveraged to compare the similarity and dissimilarity of components realized by applying different techniques. This is done by calculating the spearman correlation coefficient of the absolute correlation between two components, obtaining a similarity score. Observations are then made about the similarity of techniques and the techniques that stand out as unique.
The results indicate that SLMVP demonstrates clearer separation of classes in both single-label and multilabel datasets compared to other tested techniques. It achieved the highest accuracies in 3 out of the 4 datasets employed.
Item ID: | 75893 |
---|---|
DC Identifier: | https://oa.upm.es/75893/ |
OAI Identifier: | oai:oa.upm.es:75893 |
Deposited by: | Biblioteca Facultad de Informatica |
Deposited on: | 15 Sep 2023 09:51 |
Last Modified: | 15 Sep 2023 09:51 |