Efficient Clustering from Distributions over Topics

Badenes-Olmedo, Carlos and Redondo-García, José Luis and Corcho, Oscar (2017). Efficient Clustering from Distributions over Topics. In: "Knowledge Capture Conference (K-CAP 2017)", 04-06 Dec 2017, Austin, Texas, United States. pp. 1-8. https://doi.org/10.1145/3148011.3148019.


Title: Efficient Clustering from Distributions over Topics
  • Badenes-Olmedo, Carlos
  • Redondo-García, José Luis
  • Corcho, Oscar
Item Type: Presentation at Congress or Conference (Article)
Event Title: Knowledge Capture Conference (K-CAP 2017)
Event Dates: 04-06 Dec 2017
Event Location: Austin, Texas, United States
Title of Book: Proceedings of the Knowledge Capture Conference on - K-CAP 2017
Date: December 2017
Freetext Keywords: topic models; semantic similarity; large-scale text analysis; scholarly data
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Inteligencia Artificial
UPM's Research Group: Ontology Engineering Group OEG
Creative Commons Licenses: Recognition - Share

Full text

PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (2MB) | Preview


There are many scenarios where we may want to find pairs of textually similar documents in a large corpus (e.g. a researcher doing literature review, or an R&D project manager analyzing project proposals). To programmatically discover those connections can help experts to achieve those goals, but brute-force pairwise comparisons are not computationally adequate when the size of the document corpus is too large. Some algorithms in the literature divide the search space into regions containing potentially similar documents, which are later processed separately from the rest in order to reduce the number of pairs compared. However, this kind of unsupervised methods still incur in high temporal costs. In this paper, we present an approach that relies on the results of a topic modeling algorithm over the documents in a collection, as a means to identify smaller subsets of documents where the similarity function can then be computed. This approach has proved to obtain promising results when identifying similar documents in the domain of scientific publications. We have compared our approach against state of the art clustering techniques and with different configurations for the topic modeling algorithm. Results suggest that our approach outperforms (> 0.5) the other analyzed techniques in terms of efficiency.

Funding Projects

Government of SpainTIN2016-78011-C4-4-RUnspecifiedUnspecifiedDATOS 4.0: RETOS Y SOLUCIONES

More information

Item ID: 52009
DC Identifier: https://oa.upm.es/52009/
OAI Identifier: oai:oa.upm.es:52009
DOI: 10.1145/3148011.3148019
Official URL: https://doi.org/10.1145/3148011.3148019
Deposited by: Carlos Badenes-Olmedo
Deposited on: 03 Sep 2018 10:43
Last Modified: 03 Sep 2018 10:43
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM