Document cluster summaries and labeling

Gongolidis, Michail (2020). Document cluster summaries and labeling. Thesis (Master thesis), E.T.S. de Ingenieros Informáticos (UPM).


Title: Document cluster summaries and labeling
  • Gongolidis, Michail
  • Laki, Sandor
  • Menasalvas Ruíz, Ernestina
  • Botev, Victor
Item Type: Thesis (Master thesis)
Masters title: Data Science
Date: 2020
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Lenguajes y Sistemas Informáticos e Ingeniería del Software
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (2MB) | Preview


Summarization is the notion of abstracting key content from information sources. The task of summarization has become vital in our everyday lives especially nowadays with the exponentially growing amounts of available information that is generated over the internet. In Natural Language processing, automatic text summarization is the problem of creating short, accurate and fluent text pieces from large textual data. In this thesis, I test state-of-the-art Deep Learning models for summarizing single documents and document clusters. Later, I compare the resulting cluster summaries with principal cluster labeling techniques to determine how accurate the Deep Learning models can be in labeling document clusters. For single-document summarization I chose a model consisted of multiple neural layers based on the encoder-decoder architecture with a novel Recurrent neural cell ( Rotational Unit of Memory-RUM ) in its core. I trained and evaluated the model on a new U.S. patent dataset. For document cluster summarization, I tested a framework that combines a fundamental extractive statistical natural language processing technique ( Maximal Marginal Relevance ) with the abstractive encoder-decoder model to summarize the document clusters. Finally, I evaluated the resulting summaries by comparing them with the top words of the clusters based on the Word Frequency and their TF*IDF scores. The RUM-based encoder-decoder model surpassed all the state-of-the-art models that were tested on the selected dataset and the MMR framework successfully summarized document clusters of up to 100 U.S. patent abstracts.

More information

Item ID: 64843
DC Identifier:
OAI Identifier:
Deposited by: Biblioteca Facultad de Informatica
Deposited on: 20 Oct 2020 12:35
Last Modified: 20 Oct 2020 12:35
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM