Active semi-supervised clustering with uncertain constraints

López Pantoja, Pablo (2020). Active semi-supervised clustering with uncertain constraints. Thesis (Master thesis), E.T.S. de Ingenieros Informáticos (UPM).


Title: Active semi-supervised clustering with uncertain constraints
  • López Pantoja, Pablo
  • Zanardini, Damiano
  • Blockeel, Hendrik
Item Type: Thesis (Master thesis)
Masters title: Ingeniería Informática
Date: July 2020
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Inteligencia Artificial
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (2MB) | Preview


Clustering is the task of finding groups of elements that are highly similar. It is one of the key tools in data mining to extract knowledge from huge amounts of data. However, dierent clusterings may exist for the same dataset. In unsupervised clustering, the user has to find the appropriate combination of algorithm, hyperparameters and similarity metric to obtain the desired clustering. Semi-supervised clustering solves this by using a small amount of supervision provided by the user to bias the result towards one that fulfills the user’s expectation. We consider supervision in the form of constraints, which indicate whether two instances should belong to the same cluster. Typically, semi-supervised algorithms require the supervision a priori, whereas, in active semi-supervised clustering, the algorithm actively queries the user (oracle) for constraints. The algorithm queries the user: "Do instances x and y belong to the same cluster?". If the user answers "Yes", the constraint between the instances is a must-link, whereas, if he answers "No", the constraint is a cannot-link. However, a user may be not sure about the relation between two instances or he may not be able to get an answer in a short time. Thus, forcing him to answer may introduce erroneous constraints in the algorithm that impact its performance. Allowing a user to answer "I don’t know" to a constraint query is not considered in most previous work in this field. In this thesis, we present the non-deterministic local distance-weak oracle, an oracle model that assumes two situations in which a human oracle would be uncertain about the relation between two instances. Then, we present a baseline method to introduce "Don’t know" answers into COBRAS making minimal changes to the algorithm. This method results in a better user experience for the oracle in exchange for a loss in clustering quality. Finally, we present two methods that try to predict whether a constraint is a must-link or a cannot-link when the oracle answers "Don’t know" to its corresponding query. The first method predicts based on constraint similarity, considering the norm and cosine similarity between the uncertain constraint and a reference constraint. The second method makes predictions by training a random forest model with previously queried constraints.

More information

Item ID: 63706
DC Identifier:
OAI Identifier:
Deposited by: Biblioteca Facultad de Informatica
Deposited on: 14 Sep 2020 05:30
Last Modified: 14 Sep 2020 05:30
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM