Updating a spring data extraction and analysis pipeline

González Díez, Virginia (2025). Updating a spring data extraction and analysis pipeline. Tesis (Master), E.T.S. de Ingenieros Informáticos (UPM).

Descripción

Título: Updating a spring data extraction and analysis pipeline
Autor/es:
  • González Díez, Virginia
Director/es:
Tipo de Documento: Tesis (Master)
Título del máster: Ingeniería del Software
Fecha: Junio 2025
Materias:
ODS:
Escuela: E.T.S. de Ingenieros Informáticos (UPM)
Departamento: Lenguajes y Sistemas Informáticos e Ingeniería del Software
Licencias Creative Commons: Reconocimiento - Sin obra derivada - No comercial

Texto completo

[thumbnail of TFM_VIRGINIA_GONZALEZ_DIEZ.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (6MB)

Resumen

Understanding human diseases and discovering new therapeutic solutions remain central challenges in biomedical research. Diseases are not caused by isolated factors, but by complex networks of genes, proteins, and phenotypes interacting in dynamic systems. Traditional research methods, which often analyse isolated elements, are insufficient for capturing this complexity. In response, the field of network medicine has emerged, using systems-based models to explore disease relationships and support drug repurposing (DR), a promising strategy that identifies new uses for existing drugs. Within this context, the DISNET (DISease understanding and drug repurposing through complex NETworks) platform was developed to construct a multilayer biomedical knowledge base. It integrates data from structured databases and unstructured online sources, which includes Wikipedia and Mayo Clinic, using Natural Language Processing (NLP) techniques (specifically, the MetaMap tool) to extract and validate phenotypic information. This phenotypic layer is then connected with biological and pharmacological layers, enabling researchers to explore disease relationships and DR hypotheses through a public API and web interface. However, despite its scientific relevance, the DISNET system faced technical issues by early 2023. Its core medical text extraction pipelines had stopped functioning, and the platform relied on outdated technologies including Java 8 and early Spring Boot versions. The microservices architecture, with 16 Docker containers requiring independent configuration, had become unmanageable for the current small development team composed by one or two active developers. Additionally, the web interface was partially non-functional, and the system’s documentation was obsolete. These issues compromised the platform’s reliability and functionality. This Master’s Thesis addresses these problems through the restoration and improvement of the DISNET platform. The primary goals were to restore the system’s core functionality, reduce technical complexity, and improve maintainability. Specific objectives included reactivating the Medical Term Extraction Process (MTEP) pipelines for Wikipedia and Mayo Clinic, upgrading the software stack, consolidating the system’s databases, and redesigning the web interface. An in-depth analysis of the DISNET architecture and design was conducted to understand the system’s complexity and to develop an improvement plan. The results of this work include a fully operational and simplified DISNET system. The extraction workflows now run reliably on the unified infrastructure, the number of containers has been reduced for ease of deployment, and the web interface has been improved with updated content. These improvements not only ensure that the system can be maintained by a small development team but also restore its role as a valuable, open-access research tool. With accurate and up-to-date phenotypic data, DISNET once again enables researchers to explore disease similarities, track knowledge evolution over time, and generate novel DR hypotheses.

Más información

ID de Registro: 90343
Identificador DC: https://oa.upm.es/90343/
Identificador OAI: oai:oa.upm.es:90343
Depositado por: Biblioteca Facultad de Informatica
Depositado el: 31 Jul 2025 19:46
Ultima Modificación: 31 Jul 2025 19:46