Dynamic topic modelling for exploring the scientific literature on coronavirus: an unsupervised labelling technique

Guillén Pacho, Ibai ORCID: https://orcid.org/0000-0001-7801-8815, Badenes Olmedo, Carlos ORCID: https://orcid.org/0000-0002-2753-9917 and Corcho, Oscar ORCID: https://orcid.org/0000-0002-9260-0753 (2024). Dynamic topic modelling for exploring the scientific literature on coronavirus: an unsupervised labelling technique. "International Journal of Data Science and Analytics" ; ISSN 2364-415X. https://doi.org/10.1007/s41060-024-00610-0.

Descripción

Título: Dynamic topic modelling for exploring the scientific literature on coronavirus: an unsupervised labelling technique
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: International Journal of Data Science and Analytics
Fecha: 13 Agosto 2024
ISSN: 2364-415X
Materias:
ODS:
Palabras Clave Informales: CORD-1; CORD-19; Coronavirus; Coronaviruses; COVID-19; Dynamic topic model; Dynamic topic models; Interpretability; Labeling techniques; Labelings; Scientific Literature; Stem-Cell Transplantation; Tim; Topic interpretability; Topic labeling; Topic labelling; topic modeling
Escuela: E.T.S. de Ingenieros Informáticos (UPM)
Departamento: Inteligencia Artificial
Licencias Creative Commons: Reconocimiento - Sin obra derivada - No comercial

Texto completo

[thumbnail of 10243148.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (2MB)

Resumen

The work presented in this article focusses on improving the interpretability of probabilistic topic models created from a large collection of scientific documents that evolve over time. Several time-dependent approaches based on topic models were compared to analyse the annual evolution of latent concepts in the CORD-19 corpus: Dynamic Topic Model, Dynamic Embedded Topic Model, and BERTopic. Then COVID-19 period (December 2019-present) has been analysed in greater depth, month by month, to explore the evolution of what is written about the disease. The evaluations suggest that the Dynamic Topic Model is the best choice to analyse the CORD-19 corpus. A novel topic labelling strategy is proposed for dynamic topic models to analyse the evolution of latent concepts. It incorporates content changes in both the annual evolution of the corpus and the monthly evolution of the COVID-19 disease. The generated labels are manually validated using two approaches: through the most relevant documents on the topic and through the documents that share the most semantically similar label topics. The labelling enables the interpretation of topics. The novel method for dynamic topic labelling fits the content of each topic and supports the semantics of the topics.

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Comunidad de Madrid
PIPF-2022/COM-25947
Sin especificar
Sin especificar
Sin especificar

Más información

ID de Registro: 88044
Identificador DC: https://oa.upm.es/88044/
Identificador OAI: oai:oa.upm.es:88044
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/10243148
Identificador DOI: 10.1007/s41060-024-00610-0
URL Oficial: https://link.springer.com/article/10.1007/s41060-0...
Depositado por: iMarina Portal Científico
Depositado el: 26 Feb 2025 08:41
Ultima Modificación: 26 Feb 2025 08:57