BERTuit: Understanding Spanish language in Twitter with transformers

Huertas Tato, Javier ORCID: https://orcid.org/0000-0003-4127-5505, Martín García, Alejandro ORCID: https://orcid.org/0000-0002-0800-7632 and Camacho Fernández, David ORCID: https://orcid.org/0000-0002-5051-3475 (2023). BERTuit: Understanding Spanish language in Twitter with transformers. "Expert Systems", v. 40 (n. 9); ISSN 1468-0394. https://doi.org/10.1111/exsy.13404.

Descripción

Título: BERTuit: Understanding Spanish language in Twitter with transformers
Autor/es:
Tipo de Documento: Artículo
Título de Revista/Publicación: Expert Systems
Fecha: Noviembre 2023
ISSN: 1468-0394
Volumen: 40
Número: 9
Materias:
ODS:
Palabras Clave Informales: misinformation, online social networks, transformers, Twitter
Escuela: E.T.S.I. de Sistemas Informáticos (UPM)
Departamento: Sistemas Informáticos
Licencias Creative Commons: Reconocimiento - Sin obra derivada - No comercial

Texto completo

[thumbnail of Expert Systems - 2023 - Huertas‐Tato - BERTuit  Understanding Spanish language in Twitter with transformers.pdf] PDF (Portable Document Format) - Se necesita un visor de ficheros PDF, como GSview, Xpdf o Adobe Acrobat Reader
Descargar (2MB)

Resumen

The appearance of complex attention-based language models such as BERT, RoBERTa or GPT-3 has allowed to address highly complex tasks in a plethora of scenarios. However, when applied to specific domains, these models encounter considerable difficulties. This is the case of Social Networks such as Twitter, an ever-changing stream of information written with informal and complex language, where each message requires careful evaluation to be understood even by humans given the important role that context plays. Addressing tasks in this domain through Natural Language Processing involves severe challenges. When powerful state-of-the-art multilingual language models are applied to this scenario, language specific nuances get lost in translation. To face these challenges we present BERTuit, the largest transformer proposed so far for Spanish language, pre-trained on a massive dataset of 230 M Spanish tweets using RoBERTa optimization. Our motivation is to provide a powerful resource to better understand Spanish Twitter and to be used on applications focused on this social network, with special emphasis on solutions devoted to tackle the spreading of misinformation in this platform. BERTuit is evaluated on several tasks and compared against M-BERT, XLM-RoBERTa and XLM-T, very competitive multilingual transformers. The utility of our approach is shown with applications, in this case: an unsupervised methodology to visualize groups of hoaxes; and supervised profiling of authors spreading disinformation.

Proyectos asociados

Tipo
Código
Acrónimo
Responsable
Título
Gobierno de España
PID2020-117263GB-100
FightDIS
Sin especificar
Fighting against Information DISorders in Online Social Networks
Comunidad de Madrid
S2018/TCS-4566
CYNAMON-CM
Sin especificar
Cybersecurity, Network Analysis and Monitoring for the Next Generation Internet
Horizonte Europa
2020-EU-IA-0252:29374659
IBERIFIER
Sin especificar
Iberian Digital Media Research and Fact-Checking Hub
Gobierno de España
PLEC2021-007681
XAI-Disinfodemics
Sin especificar
eXplainable AI for disinformation and conspiracy detection during infodemics

Más información

ID de Registro: 88862
Identificador DC: https://oa.upm.es/88862/
Identificador OAI: oai:oa.upm.es:88862
URL Portal Científico: https://portalcientifico.upm.es/es/ipublic/item/10090880
Identificador DOI: 10.1111/exsy.13404
URL Oficial: https://onlinelibrary.wiley.com/doi/10.1111/exsy.1...
Depositado por: iMarina Portal Científico
Depositado el: 30 Abr 2025 17:40
Ultima Modificación: 30 Abr 2025 17:40