Comparative analysis of neural NLP models for information extraction from accounting documents

Buz, Tolga (2018). Comparative analysis of neural NLP models for information extraction from accounting documents. Thesis (Master thesis), E.T.S. de Ingenieros Informáticos (UPM).

Description

Title: Comparative analysis of neural NLP models for information extraction from accounting documents
Author/s:
  • Buz, Tolga
Contributor/s:
  • Möller, Sebastian
  • Küpper, Axel
Item Type: Thesis (Master thesis)
Masters title: Data Science
Date: 18 October 2018
Subjects:
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Otro
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[thumbnail of TFM_TOLGA_BUZ.pdf]
Preview
PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB) | Preview

Abstract

Natural Language Processing has reached a high importance in research and business applications. The state-of-the-art techniques are being used to automate tasks like extracting relevant entities from documents or translating texts from one language to another. This thesis focuses on the task of selecting models that have performed well on standard benchmarks for those tasks, and adapting them to a new and specialised problem: the labelling of entities in invoice documents. For this purpose, five state-of-the-art Neural Network models are presented, applied and evaluated. The results show that four out of the five selected models, based on recurrent and convolutional architectures, can be implemented successfully and perform similarly well on test documents with average F1 performance scores of 68-71% on word level and 67-69% on entity level. A detailed error analysis reveals that low data quality and suboptimal choice of labels due to the dataset’s origins are the main factors that influence the models’ performances. The thesis proposes a ranking of the five models with regards to their prediction performance as well as their cost and difficulty of implementation in order to answer the main research question. Possible improvements are proposed for future work, while the limitations of the project’s setting are explored and discussed. This project aims to contribute a different perspective to NER research by analysing and discussing errors and poor design choices in order to propose future improvements.

More information

Item ID: 57522
DC Identifier: https://oa.upm.es/57522/
OAI Identifier: oai:oa.upm.es:57522
Deposited by: Biblioteca Facultad de Informatica
Deposited on: 16 Dec 2019 08:59
Last Modified: 16 Dec 2019 08:59
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM