Resumen
The aim of this research is study the feasibility of classifying DNA sequences using parameters obtained using mathematical tools for sequence analysis. For this purpose, a study has been carried out on 200 DNA sequences that have been collected from different databases, such as NCBI [7] or EMBL [1]. The first step was to convert the DNA sequences into time series using a method described by Peng et al. [9]. Once the time series were obtained, the methods described in Peng et al. [9] were used to make a fluctuation analysis that provides a parameter called α. On the other hand, the time series were also used to perform a MF-DFA [6], with which h(q) values were obtained for q ∈ {−10,−9, ..., 9, 10} ∪ {±0,2} and interpolating polynomials of different degrees 1, 2 and 3. After calculating the parameters α and h(q), we used them to perform an hypothesis testing (TStudent, ANOVA, Tukey test), depending on the characteristics we wanted to clasify. Using the p-values obtained and the α and h(q) means, we can see which values could serve as classifiers. Finally, a classification has been carried out with two machine learning methods (k-means and neural networks). In both methods the study is done using only the h(q) parameters, and other classification is done using the α and h(q) parameters. The results of this research sugest that we can’t clasify the DNA sequences using neural networks because the error rates for all classifications are very high (the smallest is 0.18). This situation may be due to two possible reasons. The first is that the database is not large enough to train the classifier correctly. The second possible case is that there are not enough parameters for this task. However the hypothesis testing reveals significant differences between the parameters for the selected characteristics.