Knowledge-Graph-Based Semantic Labeling of Tabular Data

Alobaid, Ahmad (2020). Knowledge-Graph-Based Semantic Labeling of Tabular Data. Thesis (Doctoral), E.T.S. de Ingenieros Informáticos (UPM). https://doi.org/10.20868/UPM.thesis.64068.

Description

Title: Knowledge-Graph-Based Semantic Labeling of Tabular Data
Author/s:
  • Alobaid, Ahmad
Contributor/s:
  • Corcho García, Óscar
Item Type: Thesis (Doctoral)
Date: February 2020
Subjects:
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Inteligencia Artificial
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[img] PDF - Users in campus UPM only until 24 March 2021 - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB)

Abstract

A lot of data are published on the Web using tabular data formats (e.g., spreadsheets). This is especially the case for the data made available in open data portals by public and private institutions. However, one of the main challenges for their effective (re)use is their generalized lack of semantics: column names are not usually standardized, and their meaning and content are not always clear. In parallel, knowledge graphs have started to be widely adopted by some data providers as a means to publish large amounts of structured data. They commonly use graph-based formats (e.g., RDF) and make references to lightweight ontologies. It is well understood that the reuse of such tabular data may be improved by annotating them with the classes and properties used by the data available in knowledge graphs. Several challenges exist in performing semantic labeling, such as the commonality or duplication of entity names, the difference in measurements and rounding errors of numeric values, and the noise in published tabular data and knowledge graphs. In this work, we present a novel approach to automatically label columns in tabular data with ontology classes and properties referred to by existing knowledge graphs. We evaluated the performance of our approach on entity columns and numeric columns separately. For the entity columns, we applied our approach to annotated tables from the T2D gold standard. For the numeric columns, we manually annotated numeric columns in the T2D gold standard and then applied our technique to this data. We report the performance of our approach using precision, recall, and F1 scores, which is the conventional way to report the performance of semantic labeling in the literature. The experiments showed that our proposed approach successfully labeled the majority of the entity and numeric columns in the used dataset. In contrast with other existing proposals in the state-of-the-art, our approach does not require the use of external linguistic resources, other sources of information, or human in the loop.

More information

Item ID: 64068
DC Identifier: http://oa.upm.es/64068/
OAI Identifier: oai:oa.upm.es:64068
DOI: 10.20868/UPM.thesis.64068
Deposited by: Archivo Digital UPM 2
Deposited on: 28 Sep 2020 06:15
Last Modified: 26 Oct 2020 10:22
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM