A Focused Crawler in order to Get Semantic Web Resources (CSR)

Barbosa Santillán, Liliana Ibeth and Campos Quirarte, Juana Elizabeth and Castro Munguía, Aldo (2013). A Focused Crawler in order to Get Semantic Web Resources (CSR). In: "Workshop on Semantic Web, ENC 2013", 30 Oct - 1 Nov 2013, Michoacán, Méjico.. ISBN 978-607-9343-23-1. pp. 114-120.

Description

Title: A Focused Crawler in order to Get Semantic Web Resources (CSR)
Author/s:
  • Barbosa Santillán, Liliana Ibeth
  • Campos Quirarte, Juana Elizabeth
  • Castro Munguía, Aldo
Item Type: Presentation at Congress or Conference (Other)
Event Title: Workshop on Semantic Web, ENC 2013
Event Dates: 30 Oct - 1 Nov 2013
Event Location: Michoacán, Méjico.
Title of Book: Workshops Proceedings in the Mexican International Conference on Computer Science (ENC 2013)
Date: 2013
ISBN: 978-607-9343-23-1
Subjects:
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Inteligencia Artificial
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[img]
Preview
PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (368kB) | Preview

Abstract

This paper presents a Focused Crawler in order to Get Semantic Web Resources (CSR). Structured data web are available in formats such as Extensible Markup Language (XML), Resource Description Framework (RDF) and Ontology Web Language (OWL) that can be used for processing. One of the main challenges for performing a manual search and download semantic web resources is that this task consumes a lot of time. Our research work propose a focused crawler which allow to download these resources automatically and store them on disk in order to have a collection that will be used for data processing. CRS consists of three layers: (a) The User Interface Layer, (b) The Focus Crawler Layer and (c) The Base Crawler Layer. CSR uses as a selection policie the Shark-Search method. CSR was conducted with two experiments. The first one starts on December 15 2012 at 7:11 am and ends on December 16 2012 at 4:01 were obtained 448,123,537 bytes of data. The CSR ends by itself after to analyze 80,4375 seeds with an unlimited depth. CSR got 16,576 semantic resources files where the 89 % was RDF, the 10 % was XML and the 1% was OWL. The second one was based on the Web Data Commons work of the Research Group Data and Web Science at the University of Mannheim and the Institute AIFB at the Karlsruhe Institute of Technology. This began at 4:46 am of June 2 2013 and 1:37 am June 9 2013. After 162.51 hours of execution the result was 285,279 semantic resources where predominated the XML resources with 99 % and OWL and RDF with 1 % each one.

More information

Item ID: 36867
DC Identifier: http://oa.upm.es/36867/
OAI Identifier: oai:oa.upm.es:36867
Official URL: http://computo.fismat.umich.mx/enc2013/new/index.php/accepted-papers/accepted-papers-workshop-web
Deposited by: Memoria Investigacion
Deposited on: 10 Sep 2015 09:52
Last Modified: 10 Sep 2015 09:52
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM