Text classification of cyberbullying comments: a study on the applicability of various BERT models

Gong, Seong-Min (2023). Text classification of cyberbullying comments: a study on the applicability of various BERT models. Thesis (Master thesis), E.T.S. de Ingenieros Informáticos (UPM).

Description

Title: Text classification of cyberbullying comments: a study on the applicability of various BERT models
Author/s:
  • Gong, Seong-Min
Contributor/s:
Item Type: Thesis (Master thesis)
Masters title: Ciencia de Datos
Date: June 2023
Subjects:
Freetext Keywords: NLP, BERT, RoBERTa, DeBERTa, Cyberbullying Detection, Text Classification.
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Inteligencia Artificial
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[thumbnail of TFM_SEONG-MIN_GONG.pdf] PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB)

Abstract

Despite the significant advantages of the contemporary digital landscape, it also provokes considerable societal dilemmas, one of which is cyberbullying. Characterized by attacks or derogatory comments made under the cloak of online anonymity, this detrimental behavior induces psychological distress in victims and negatively impacts the overall wellness of digital communication environments. In light of this, there is a growing demand for solutions employing Natural Language Processing (NLP) technologies.

In response, this study explores the application of NLP technologies, sophisticated tools that empower computers to comprehend and process human language, in the context of cyberbullying. Three pre-trained models, BERT, RoBERTa and DeBERTa which are proficient in deciphering context and inferring connotations from textual data, were utilized in this investigation.

A dataset comprising cyberbullying comments extracted from digital platforms such as Twitter and Kaggle was compiled. An evaluation of the models was conducted to determine their efficacy in detecting malicious comments, thereby revealing areas for improvement during the fine-tuning process. The results showed that all three models achieved an accuracy of over 93%. Particularly, the BERT Multilingual model demonstrated an accuracy of 94.1%. However, during the transfer learning process, the RoBERTa and DeBERTa models, excluding the BERT models (Base: 84.9%, Large: 86.6%, Multilingual: 85.9%), exhibited lower accuracy in the range of the 63%.

Based on the findings, a need for additional research into the optimization of fine-tuning strategies and model development reflecting linguistic diversity was identified. By pursuing these research directions, the performance of text classification models intended for cyberbullying comment prevention can be greatly enhanced. This approach serves as a crucial step towards managing cyberbullying issues more effectively and minimizing the harm they cause, thereby fostering healthier communication within online communities.

More information

Item ID: 75836
DC Identifier: https://oa.upm.es/75836/
OAI Identifier: oai:oa.upm.es:75836
Deposited by: Biblioteca Facultad de Informatica
Deposited on: 13 Sep 2023 10:24
Last Modified: 13 Sep 2023 10:24
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM