Transformer-based multistage architectures for code search

González López, Ángel Luis (2021). Transformer-based multistage architectures for code search. Thesis (Master thesis), E.T.S. de Ingenieros Informáticos (UPM).


Title: Transformer-based multistage architectures for code search
  • González López, Ángel Luis
  • Payberah, Amir
  • Peña, Francisco J.
  • Pashami, Sepideh
  • Al-Shishtawy, Ahmad
Item Type: Thesis (Master thesis)
Masters title: Digital Innovation: Data Science
Date: September 2021
Freetext Keywords: Code search, Natural Language Processing, BERT, Information retrieval
Faculty: E.T.S. de Ingenieros Informáticos (UPM)
Department: Otro
Creative Commons Licenses: Recognition - No derivative works - Non commercial

Full text

[thumbnail of TFM_ANGEL_LUIS_GONZALEZ_LOPEZ.pdf] PDF - Requires a PDF viewer, such as GSview, Xpdf or Adobe Acrobat Reader
Download (1MB)


Code Search is one of the most common tasks for developers. The open-source software movement and the rise of social media have made this process easier thanks to the vast public software repositories available to everyone and the Q&A sites where individuals can resolve their doubts. However, in the case of poorly documented code that is difficult to search in a repository, or in the case of private enterprise frameworks that are not publicly available, so there is not a community on Q&A sites to answer questions, searching for code snippets to solve doubts or learn how to use an API becomes very complicated. In order to solve this problem, this thesis studies the use of natural language in code retrieval. In particular, it studies transformer-based models, such as Bidirectional Encoder Representations from Transformers (BERT), which are currently state of the art in natural language processing but present high latency in information retrieval tasks. That is why this project proposes a multi-stage architecture that seeks to maintain the performance of standard BERT-based models while reducing the high latency usually associated with the use of this type of framework. Experiments show that this architecture outperforms previous nonBERT-based models by +0.17 on the Top 1 (or Recall@1) metric and reduces latency with inference times 5% of those of standard BERT models.

More information

Item ID: 70462
DC Identifier:
OAI Identifier:
Deposited by: Biblioteca Facultad de Informatica
Deposited on: 11 May 2022 10:52
Last Modified: 11 May 2022 10:52
  • Logo InvestigaM (UPM)
  • Logo GEOUP4
  • Logo Open Access
  • Open Access
  • Logo Sherpa/Romeo
    Check whether the anglo-saxon journal in which you have published an article allows you to also publish it under open access.
  • Logo Dulcinea
    Check whether the spanish journal in which you have published an article allows you to also publish it under open access.
  • Logo de Recolecta
  • Logo del Observatorio I+D+i UPM
  • Logo de OpenCourseWare UPM